Note - IT eBooks

Transcript

Note - IT eBooks
DataAnalysiswithR
TableofContents
DataAnalysiswithR
Credits
AbouttheAuthor
AbouttheReviewer
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Downloadingthecolorimagesofthisbook
Errata
Piracy
Questions
1.RefresheR
Navigatingthebasics
Arithmeticandassignment
Logicalsandcharacters
Flowofcontrol
GettinghelpinR
Vectors
Subsetting
Vectorizedfunctions
Advancedsubsetting
Recycling
Functions
Matrices
LoadingdataintoR
Workingwithpackages
Exercises
Summary
2.TheShapeofData
Univariatedata
Frequencydistributions
Centraltendency
Spread
Populations,samples,andestimation
Probabilitydistributions
Visualizationmethods
Exercises
Summary
3.DescribingRelationships
Multivariatedata
Relationshipsbetweenacategoricalandacontinuousvariable
Relationshipsbetweentwocategoricalvariables
Therelationshipbetweentwocontinuousvariables
Covariance
Correlationcoefficients
Comparingmultiplecorrelations
Visualizationmethods
Categoricalandcontinuousvariables
Twocategoricalvariables
Twocontinuousvariables
Morethantwocontinuousvariables
Exercises
Summary
4.Probability
Basicprobability
Ataleoftwointerpretations
Samplingfromdistributions
Parameters
Thebinomialdistribution
Thenormaldistribution
Thethree-sigmaruleandusingz-tables
Exercises
Summary
5.UsingDatatoReasonAbouttheWorld
Estimatingmeans
Thesamplingdistribution
Intervalestimation
Howdidweget1.96?
Smallersamples
Exercises
Summary
6.TestingHypotheses
NullHypothesisSignificanceTesting
Oneandtwo-tailedtests
Whenthingsgowrong
Awarningaboutsignificance
Awarningaboutp-values
Testingthemeanofonesample
Assumptionsoftheonesamplet-test
Testingtwomeans
Don’tbefooled!
Assumptionsoftheindependentsamplest-test
Testingmorethantwomeans
AssumptionsofANOVA
Testingindependenceofproportions
Whatifmyassumptionsareunfounded?
Exercises
Summary
7.BayesianMethods
ThebigideabehindBayesiananalysis
Choosingaprior
Whocaresaboutcoinflips
EnterMCMC–stageleft
UsingJAGSandrunjags
FittingdistributionstheBayesianway
TheBayesianindependentsamplest-test
Exercises
Summary
8.PredictingContinuousVariables
Linearmodels
Simplelinearregression
Simplelinearregressionwithabinarypredictor
Awordofwarning
Multipleregression
Regressionwithanon-binarypredictor
Kitchensinkregression
Thebias-variancetrade-off
Cross-validation
Strikingabalance
Linearregressiondiagnostics
SecondAnscomberelationship
ThirdAnscomberelationship
FourthAnscomberelationship
Advancedtopics
Exercises
Summary
9.PredictingCategoricalVariables
k-NearestNeighbors
Usingk-NNinR
Confusionmatrices
Limitationsofk-NN
Logisticregression
UsinglogisticregressioninR
Decisiontrees
Randomforests
Choosingaclassifier
Theverticaldecisionboundary
Thediagonaldecisionboundary
Thecrescentdecisionboundary
Thecirculardecisionboundary
Exercises
Summary
10.SourcesofData
RelationalDatabases
Whydidn’twejustdothatinSQL?
UsingJSON
XML
Otherdataformats
Onlinerepositories
Exercises
Summary
11.DealingwithMessyData
Analysiswithmissingdata
Visualizingmissingdata
Typesofmissingdata
Sowhichoneisit?
Unsophisticatedmethodsfordealingwithmissingdata
Completecaseanalysis
Pairwisedeletion
Meansubstitution
Hotdeckimputation
Regressionimputation
Stochasticregressionimputation
Multipleimputation
Sohowdoesmicecomeupwiththeimputedvalues?
Methodsofimputation
Multipleimputationinpractice
Analysiswithunsanitizeddata
Checkingforout-of-boundsdata
Checkingthedatatypeofacolumn
Checkingforunexpectedcategories
Checkingforoutliers,entryerrors,orunlikelydatapoints
Chainingassertions
Othermessiness
OpenRefine
Regularexpressions
tidyr
Exercises
Summary
12.DealingwithLargeData
Waittooptimize
Usingabiggerandfastermachine
Besmartaboutyourcode
Allocationofmemory
Vectorization
Usingoptimizedpackages
UsinganotherRimplementation
Useparallelization
GettingstartedwithparallelR
Anexampleof(some)substance
UsingRcpp
Besmarteraboutyourcode
Exercises
Summary
13.ReproducibilityandBestPractices
RScripting
RStudio
RunningRscripts
Anexamplescript
Scriptingandreproducibility
Rprojects
Versioncontrol
Communicatingresults
Exercises
Summary
Index
DataAnalysiswithR
DataAnalysiswithR
Copyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,
ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthe
publisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyofthe
informationpresented.However,theinformationcontainedinthisbookissoldwithout
warranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,andits
dealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecaused
directlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthe
companiesandproductsmentionedinthisbookbytheappropriateuseofcapitals.
However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:December2015
Productionreference:1171215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78528-814-2
www.packtpub.com
Credits
Author
TonyFischetti
Reviewer
DipanjanSarkar
CommissioningEditor
AkramHussain
AcquisitionEditor
MeetaRajani
ContentDevelopmentEditor
AnishDhurat
TechnicalEditor
SiddheshPatil
CopyEditor
SoniaMathur
ProjectCoordinator
BijalPatel
Proofreader
SafisEditing
Indexer
MonicaAjmeraMehta
Graphics
DishaHaria
ProductionCoordinator
ConidonMiranda
CoverWork
ConidonMiranda
AbouttheAuthor
TonyFischettiisadatascientistatCollegeFactual,wherehegetstouseReverydayto
buildpersonalizedrankingsandrecommendersystems.Hegraduatedincognitivescience
fromRensselaerPolytechnicInstitute,andhisthesiswasstronglyfocusedonusing
statisticstostudyvisualshort-termmemory.
Tonyenjoyswritingandcontributingtoopensourcesoftware,bloggingat
http://www.onthelambda.com,writingabouthimselfinthirdperson,andsharinghis
knowledgeusingsimple,approachablelanguageandengagingexamples.
Themoretraditionallyexcitingofhisdailyactivitiesincludelisteningtorecords,playing
theguitarandbass(poorly),weighttraining,andhelpingothers.
BecauseI’mawareofhowincrediblyluckyIam,it’sreallyhardtoexpressallthe
gratitudeIhaveforeveryoneinmylifethathelpedme—eitherdirectly,orindirectly—in
completingthisbook.Thefollowing(partial)listismybestattemptatbalancing
thoroughnesswhilstalsomaximizingthenumberofpeoplewhowillreadthissectionby
keepingittoamanageablelength.
First,I’dliketothankallofmyeducators.Inparticular,I’dliketothanktheBronxHigh
SchoolofScienceandRensselaerPolytechnicInstitute.Morespecifically,I’dlikethe
BronxScienceRoboticsTeam,allit’smembers,it’steammoms,thewonderfulDenaFord
andCherrieFleisher-Strauss;andJustinFox.Fromthelatterinstitution,I’dliketothank
allofmyprofessorsandadvisors.ShoutouttoMikeKalsher,MichaelSchoelles,Wayne
Gray,BramvanHeuveln,LarryReid,andKeithAnderson(especiallyKeithAnderson).
I’dliketothanktheNewYorkPublicLibrary,Wikipedia,andotherfreelyavailable
educationalresources.Onarelatednote,IneedtothanktheRcommunityand,more
generally,alloftheauthorsofRpackagesandotheropensourcesoftwareIusefor
spendingtheirownpersonaltimetobenefithumanity.ShoutouttoGNU,theRcoreteam,
andHadleyWickham(whowroteamajorityoftheRpackagesIusedaily).
Next,I’dliketothankthecompanyIworkfor,CollegeFactual,andallofmybrilliantcoworkersfromwhomI’velearnedsomuch.
Ialsoneedtothankmysupportnetworkofmillions,andmymanymanyfriendsthathave
allhelpedmemorethantheywilllikelyeverrealize.
I’dliketothankmypartner,BethanyWickham,whohasbeenabsolutelyinstrumentalin
providingmuchneededandappreciatedemotionalsupportduringthewritingofthisbook,
andputtingupwiththemoodswingsthatcomealongwithworkingalldayandwritingall
night.
Next,I’dliketoexpressmygratitudeformysister,AndreaFischetti,whomeansthe
worldtome.Throughoutmylife,she’skeptmewarmandhumaninspiteofthescientist
inmethatlikestogetallreductionistandcerebral.
Finally,andmostimportantly,I’dliketothankmyparents.Thisbookisformyfather,to
whomIowemyloveoflearningandmyinterestinscienceandstatistics;andtomy
motherforherloveandunwaveringsupportand,towhomIowemyworkethicand
abilitytohandleanythingandtackleanychallenge.
AbouttheReviewer
DipanjanSarkarisanITengineeratIntel,theworld’slargestsiliconcompany,wherehe
worksonanalytics,businessintelligence,andapplicationdevelopment.Hereceivedhis
master’sdegreeininformationtechnologyfromtheInternationalInstituteofInformation
Technology,Bangalore.Dipanjan’sareaofspecializationincludessoftwareengineering,
datascience,machinelearning,andtextanalytics.
Hisinterestsincludelearningaboutnewtechnologies,disruptivestart-ups,anddata
science.Inhissparetime,helovesreading,playinggames,andwatchingpopularsitcoms.
DipanjanalsoreviewedLearningRforGeospatialAnalysisandRDataAnalysis
Cookbook,bothbyPacktPublishing.
IwouldliketothankBijalPatel,theprojectcoordinatorofthisbook,formakingthe
reviewingexperiencereallyinteractiveandenjoyable.
www.PacktPub.com
Supportfiles,eBooks,discountoffers,and
more
Forsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFand
ePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandas
aprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwith
usat<[email protected]>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signup
forarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooks
andeBooks.
https://www2.packtpub.com/books/subscription/packtlib
DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigital
booklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.
Whysubscribe?
FullysearchableacrosseverybookpublishedbyPackt
Copyandpaste,print,andbookmarkcontent
Ondemandandaccessibleviaawebbrowser
FreeaccessforPacktaccountholders
IfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccess
PacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsfor
immediateaccess.
Preface
I’mgoingtoshootittoyoustraight:therearealotofbooksaboutdataanalysisandtheR
programminglanguage.I’lltakeitonfaiththatyoualreadyknowwhyit’sextremely
helpfulandfruitfultolearnRanddataanalysis(ifnot,whyareyoureadingthispreface?!)
butallowmetomakeacaseforchoosingthisbooktoguideyouinyourjourney.
Forone,thissubjectdidn’tcomenaturallytome.Therearethosewithaninnatetalentfor
graspingtheintricaciesofstatisticsthefirsttimeitistaughttothem;Idon’tthinkI’mone
ofthesepeople.IkeptatitbecauseIlovescienceandresearchandknewthatdataanalysis
wasnecessary,notbecauseitimmediatelymadesensetome.Today,Ilovethesubjectin
andofitself,ratherthaninstrumentally,butthisonlycameaftermonthsofheartache.
Eventually,asIconsumedresourceafterresource,thepiecesofthepuzzlestartedtocome
together.Afterthis,Istartedtutoringallofmyfriendsinthesubject—andhaveseenthem
tripoverthesameobstaclesthatIhadtolearntoclimb.Ithinkthatcomingfromthis
backgroundgivesmeauniqueperspectiveontheplightofthestatisticsstudentandallows
metoreachtheminawaythatothersmaynotbeableto.Bytheway,don’tletthefact
thatstatisticsusedtobafflemescareyou;IhaveitonfairlygoodauthoritythatIknow
whatI’mtalkingabouttoday.
Secondly,thisbookwasbornofthefrustrationthatmoststatisticstextstendtobewritten
inthedriestmannerpossible.Incontrast,Iadoptalight-heartedbuoyantapproach—but
withoutbecomingagonizinglyflippant.
Third,thisbookincludesalotofmaterialthatIwishedwerecoveredinmoreofthe
resourcesIusedwhenIwaslearningaboutdataanalysisinR.Forexample,theentirelast
unitspecificallycoverstopicsthatpresentenormouschallengestoRanalystswhenthey
firstgoouttoapplytheirknowledgetoimperfectreal-worlddata.
Lastly,Ithoughtlongandhardabouthowtolayoutthisbookandwhichorderoftopics
wasoptimal.AndwhenIsaylongandhardImeanIwrotealibraryanddesigned
algorithmstodothis.TheorderinwhichIpresentthetopicsinthisbookwasvery
carefullyconsideredto(a)buildontopofeachother,(b)followareasonablelevelof
difficultyprogressionallowingforperiodicchaptersofrelativelysimplermaterial
(psychologistscallthisintermittentreinforcement),(c)grouphighlyrelatedtopics
together,and(d)minimizethenumberoftopicsthatrequireknowledgeofyetunlearned
topics(thisis,unfortunately,commoninstatistics).Ifyou’reinterested,Idetailthis
procedureinablogpostthatyoucanreadathttp://bit.ly/teach-stats.
Thepointisthatthebookyou’reholdingisaveryspecialone—onethatIpouredmysoul
into.Nevertheless,dataanalysiscanbeanotoriouslydifficultsubject,andtheremaybe
timeswherenothingseemstomakesense.Duringthesetimes,rememberthatmanyothers
(includingmyself)havefeltstuck,too.Persevere…therewardisgreat.Andremember,if
ablockheadlikemecandoit,youcan,too.Goyou!
Whatthisbookcovers
Chapter1,RefresheR,reviewstheaspectsofRthatsubsequentchapterswillassume
knowledgeof.Here,welearnthebasicsofRsyntax,learnR’smajordatastructures,write
functions,loaddataandinstallpackages.
Chapter2,TheShapeofData,discussesunivariatedata.Welearnaboutdifferentdata
types,howtodescribeunivariatedata,andhowtovisualizetheshapeofthesedata.
Chapter3,DescribingRelationships,goesontothesubjectofmultivariatedata.In
particular,welearnaboutthethreemainclassesofbivariaterelationshipsandlearnhowto
describethem.
Chapter4,Probability,kicksoffanewunitbylayingfoundation.Welearnaboutbasic
probabilitytheory,Bayes’theorem,andprobabilitydistributions.
Chapter5,UsingDatatoReasonAbouttheWorld,discussessamplingandestimation
theory.Throughexamples,welearnofthecentrallimittheorem,pointestimationand
confidenceintervals.
Chapter6,TestingHypotheses,introducesthesubjectofNullHypothesisSignificance
Testing(NHST).Welearnmanypopularhypothesistestsandtheirnon-parametric
alternatives.Mostimportantly,wegainathoroughunderstandingofthemisconceptions
andgotchasofNHST.
Chapter7,BayesianMethods,introducesanalternativetoNHSTbasedonamoreintuitive
viewofprobability.Welearntheadvantagesanddrawbacksofthisapproach,too.
Chapter8,PredictingContinuousVariables,thoroughlydiscusseslinearregression.
Beforethechapter’sconclusion,welearnallaboutthetechnique,whentouseit,andwhat
trapstolookoutfor.
Chapter9,PredictingCategoricalVariables,introducesfourofthemostpopular
classificationtechniques.Byusingallfouronthesameexamples,wegainanappreciation
forwhatmakeseachtechniqueshine.
Chapter10,SourcesofData,isallabouthowtousedifferentdatasourcesinR.In
particular,welearnhowtointerfacewithdatabases,andrequestandloadJSONandXML
viaanengagingexample.
Chapter11,DealingwithMessyData,introducessomeofthesnagsofworkingwithless
thanperfectdatainpractice.Thebulkofthischapterisdedicatedtomissingdata,
imputation,andidentifyingandtestingformessydata.
Chapter12,DealingwithLargeData,discussessomeofthetechniquesthatcanbeusedto
copewithdatasetsthatarelargerthancanbehandledswiftlywithoutalittleplanning.
ThekeycomponentsofthischapterareonparallelizationandRcpp.
Chapter13,ReproducibilityandBestPractices,closeswiththeextremelyimportant(but
oftenignored)topicofhowtouseRlikeaprofessional.Thisincludeslearningabout
tooling,organization,andreproducibility.
Whatyouneedforthisbook
AllcodeinthisbookhasbeenwrittenagainstthelatestversionofR—3.2.2atthetimeof
writing.Asamatterofgoodpractice,youshouldkeepyourRversionuptodatebutmost,
ifnotall,codeshouldworkwithanyreasonablyrecentversionofR.SomeoftheR
packageswewillbeinstallingwillrequiremorerecentversions,though.Fortheother
softwarethatthisbookuses,instructionswillbefurnishedprorenata.Ifyouwanttogeta
headstart,however,installRStudio,JAGS,andaC++compiler(orRtoolsifyouuse
Windows).
Whothisbookisfor
Whetheryouarelearningdataanalysisforthefirsttime,oryouwanttodeepenthe
understandingyoualreadyhave,thisbookwillprovetoaninvaluableresource.Ifyouare
lookingforabooktobringyouallthewaythroughthefundamentalstotheapplicationof
advancedandeffectiveanalyticsmethodologies,andhavesomepriorprogramming
experienceandamathematicalbackground,thenthisisforyou.
Conventions
Inthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkinds
ofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheir
meaning.
Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,
pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“Wewill
usethesystem.timefunctiontotimetheexecution.”
Ablockofcodeissetasfollows:
library(VIM)
aggr(miss_mtcars,numbers=TRUE)
Anycommand-lineinputoroutputiswrittenasfollows:
#R--vanillaCMDBATCHnothing.R
Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,
forexample,inmenusordialogboxes,appearinthetextlikethis:“ClickingtheNext
buttonmovesyoutothenextscreen.”
Note
Warningsorimportantnotesappearinaboxlikethis.
Tip
Tipsandtricksappearlikethis.
Readerfeedback
Feedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthis
book—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsus
developtitlesthatyouwillreallygetthemostoutof.
Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthe
book’stitleinthesubjectofyourmessage.
Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingor
contributingtoabook,seeourauthorguideatwww.packtpub.com/authors.
Customersupport
NowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelp
youtogetthemostfromyourpurchase.
Downloadingtheexamplecode
Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.com
forallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbook
elsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilesemaileddirectlytoyou.
Downloadingthecolorimagesofthisbook
WealsoprovideyouwithaPDFfilethathascolorimagesofthescreenshots/diagrams
usedinthisbook.Thecolorimageswillhelpyoubetterunderstandthechangesinthe
output.Youcandownloadthisfilefrom
https://www.packtpub.com/sites/default/files/downloads/Data_Analysis_With_R_ColorImages.pd
Errata
Althoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdo
happen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthe
code—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveother
readersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufind
anyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,
selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthe
detailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedand
theerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataunderthe
Erratasectionofthattitle.
Toviewthepreviouslysubmittederrata,goto
https://www.packtpub.com/books/content/supportandenterthenameofthebookinthe
searchfield.TherequiredinformationwillappearundertheErratasection.
Piracy
PiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.At
Packt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucome
acrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswith
thelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.
Pleasecontactusat<[email protected]>withalinktothesuspectedpirated
material.
Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluable
content.
Questions
Ifyouhaveaproblemwithanyaspectofthisbook,youcancontactusat
<[email protected]>,andwewilldoourbesttoaddresstheproblem.
Chapter1.RefresheR
Beforewediveintothe(other)funstuff(samplingmulti-dimensionalprobability
distributions,usingconvexoptimizationtofitdatamodels,andsoon),itwouldbehelpful
ifwereviewthoseaspectsofRthatallsubsequentchapterswillassumeknowledgeof.
IfyoufancyyourselfasanRguru,youshouldstill,atleast,skimthroughthischapter,
becauseyou’llalmostcertainlyfindtheidioms,packages,andstyleintroducedheretobe
beneficialinfollowingalongwiththerestofthematerial.
Ifyoudon’tcaremuchaboutR(yet),andarejustinthisforthestatistics,youcanheavea
heavysighofreliefthat,forthemostpart,youcanrunthecodegiveninthisbookinthe
interactiveRinterpreterwithverylittlemodification,andjustfollowalongwiththeideas.
However,itismybelief(read:delusion)thatbytheendofthisbook,you’llcultivatea
newfoundappreciationofRalongsidearobustunderstandingofmethodsindataanalysis.
FireupyourRinterpreter,andlet’sgetstarted!
Navigatingthebasics
IntheinteractiveRinterpreter,anylinestartingwitha>characterdenotesRaskingfor
input(Ifyouseea+prompt,itmeansthatyoudidn’tfinishtypingastatementatthe
promptandRisaskingyoutoprovidetherestoftheexpression.).Strikingthereturnkey
willsendyourinputtoRtobeevaluated.R’sresponseisthenspitbackatyouintheline
immediatelyfollowingyourinput,afterwhichRasksformoreinput.Thisiscalleda
REPL(Read-Evaluate-Print-Loop).ItisalsopossibleforRtoreadabatchof
commandssavedinafile(unsurprisinglycalledbatchmode),butwe’llbeusingthe
interactivemodeformostofthebook.
Asyoumightimagine,Rsupportsallthefamiliarmathematicaloperatorsasmostother
languages:
Arithmeticandassignment
Checkoutthefollowingexample:
>2+2
[1]4
>9/3
[1]3
>5%%2#modulusoperator(remainderof5dividedby2)
[1]1
Anythingthatoccursaftertheoctothorpeorpoundsign,#,(orhash-tagforyou
young’uns),isignoredbytheRinterpreter.Thisisusefulfordocumentingthecodein
naturallanguage.Thesearecalledcomments.
Inamulti-operationarithmeticexpression,Rwillfollowthestandardorderofoperations
frommath.Inordertooverridethisnaturalorder,youhavetouseparenthesesflankingthe
sub-expressionthatyou’dliketobeperformedfirst.
>3+2-10^2#^istheexponentoperator
[1]-95
>3+(2-10)^2
[1]67
Inpractice,almostallcompoundexpressionsaresplitupwithintermediatevalues
assignedtovariableswhich,whenusedinfutureexpressions,arejustlikesubstitutingthe
variablewiththevaluethatwasassignedtoit.The(primary)assignmentoperatoris<-.
>#assignmentsfollowtheformVARIABLE<-VALUE
>var<-10
>var
[1]10
>var^2
[1]100
>VAR/2#variablenamesarecase-sensitive
Error:object'VAR'notfound
Noticethatthefirstandsecondlinesintheprecedingcodesnippetdidn’thaveanoutputto
bedisplayed,soRjustimmediatelyaskedformoreinput.Thisisbecauseassignments
don’thaveareturnvalue.Theironlyjobistogiveavaluetoavariable,ortochangethe
existingvalueofavariable.Generally,operationsandfunctionsonvariablesinRdon’t
changethevalueofthevariable.Instead,theyreturntheresultoftheoperation.Ifyou
wanttochangeavariabletotheresultofanoperationusingthatvariable,youhaveto
reassignthatvariableasfollows:
>var#varis10
[1]10
>var^2
[1]100
>var#varisstill10
[1]10
>var<-var^2#noreturnvalue
>var#varisnow100
[1]100
Beawarethatvariablenamesmaycontainnumbers,underscores,andperiods;thisis
somethingthattripsupalotofpeoplewhoarefamiliarwithotherprogramminglanguages
thatdisallowusingperiodsinvariablenames.Theonlyfurtherrestrictionsonvariable
namesarethatitmuststartwithaletter(oraperiodandthenaletter),andthatitmustnot
beoneofthereservedwordsinRsuchasTRUE,Inf,andsoon.
Althoughthearithmeticoperatorsthatwe’veseenthusfararefunctionsintheirownright,
mostfunctionsinRtaketheform:function_name(value(s)suppliedtothefunction).The
valuessuppliedtothefunctionarecalledargumentsofthatfunction.
>cos(3.14159)#cosinefunction
[1]-1
>cos(pi)#piisaconstantthatRprovides
[1]-1
>acos(-1)#arccosinefunction
[1]2.141593
>acos(cos(pi))+10
[1]13.14159
>#functionscanbeusedasargumentstootherfunctions
(Ifyoupaidattentioninmathclass,you’llknowthatthecosineofπis-1,andthat
arccosineistheinversefunctionofcosine.)
TherearehundredsofsuchusefulfunctionsdefinedinbaseR,onlyahandfulofwhichwe
willseeinthisbook.Twosectionsfromnow,wewillbebuildingourveryownfunctions.
Beforewemoveonfromarithmetic,itwillserveuswelltovisitsomeoftheoddvalues
thatmayresultfromcertainoperations:
>1/0
[1]Inf
>0/0
[1]NaN
ItiscommonduringpracticalusageofRtoaccidentallydividebyzero.Asyoucansee,
thisundefinedoperationyieldsaninfinitevalueinR.Dividingzerobyzeroyieldsthe
valueNaN,whichstandsforNotaNumber.
Logicalsandcharacters
Sofar,we’veonlybeendealingwithnumerics,butthereareotheratomicdatatypesinR.
Towit:
>foo<-TRUE#fooisofthelogicaldatatype
>class(foo)#class()tellsusthetype
[1]"logical"
>bar<-"hi!"#barisofthecharacterdatatype
>class(bar)
[1]"character"
Thelogicaldatatype(alsocalledBooleans)canholdthevaluesTRUEorFALSEor,
equivalently,TorF.ThefamiliaroperatorsfromBooleanalgebraaredefinedforthese
types:
>foo
[1]TRUE
>foo&&TRUE#booleanand
[1]TRUE
>foo&&FALSE
[1]FALSE
>foo||FALSE#booleanor
[1]TRUE
>!foo#negationoperator
[1]FALSE
InaBooleanexpressionwithalogicalvalueandanumber,anynumberthatisnot0is
interpretedasTRUE.
>foo&&1
[1]TRUE
>foo&&2
[1]TRUE
>foo&&0
[1]FALSE
Additionally,therearefunctionsandoperatorsthatreturnlogicalvaluessuchas:
>4<2#lessthanoperator
[1]FALSE
>4>=4#greaterthanorequalto
[1]TRUE
>3==3#equalityoperator
[1]TRUE
>3!=2#inequalityoperator
[1]TRUE
JustastherearefunctionsinRthatareonlydefinedforworkonthenumericandlogical
datatype,thereareotherfunctionsthataredesignedtoworkonlywiththecharacterdata
type,alsoknownasstrings:
>lang.domain<-"statistics"
>lang.domain<-toupper(lang.domain)
>print(lang.domain)
[1]"STATISTICS"
>#retrievessubstringfromfirstcharactertofourthcharacter
>substr(lang.domain,1,4)
[1]"STAT"
>gsub("I","1",lang.domain)#substitutesevery"I"for"1"
[1]"STAT1ST1CS"
#combinescharacterstrings
>paste("Rdoes",lang.domain,"!!!")
[1]"RdoesSTATISTICS!!!"
Flowofcontrol
Thelasttopicinthissectionwillbeflowofcontrolconstructs.
Themostbasicflowofcontrolconstructistheifstatement.Theargumenttoanif
statement(whatgoesbetweentheparentheses),isanexpressionthatreturnsalogical
value.Theblockofcodefollowingtheifstatementgetsexecutedonlyiftheexpression
yieldsTRUE.Forexample:
>if(2+2==4)
+print("verygood")
[1]"verygood"
>if(2+2==5)
+print("allhailtothethief")
>
Itispossibletoexecutemorethanonestatementifanifconditionistriggered;youjust
havetousecurlybrackets({})tocontainthestatements.
>if((4/2==2)&&(2*2==4)){
+print("fourdividedbytwoistwo…")
+print("andtwotimestwoisfour")
+}
[1]"fourdividedbytwoistwo…"
[1]"andtwotimestwoisfour"
>
Itisalsopossibletospecifyablockofcodethatwillgetexecutediftheifconditionalis
FALSE.
>closing.time<-TRUE
>if(closing.time){
+print("youdon'thavetogohome")
+print("butyoucan'tstayhere")
+}else{
+print("youcanstayhere!")
+}
[1]"youdon'thavetogohome"
[1]"butyoucan'tstayhere"
>if(!closing.time){
+print("youdon'thavetogohome")
+print("butyoucan'tstayhere")
+}else{
+print("youcanstayhere!")
+}
[1]"youcanstayhere!"
>
Thereareotherflowofcontrolconstructs(likewhileandfor),butwewon’tdirectlybe
usingthemmuchinthistext.
GettinghelpinR
Beforewegofurther,itwouldserveuswelltohaveabriefsectiondetailinghowtoget
helpinR.MostRtutorialsleavethisforoneofthelastsections—ifitisevenincludedat
all!Inmyownpersonalexperience,though,gettinghelpisgoingtobeoneofthefirst
thingsyouwillwanttodoasyouaddmorebrickstoyourRknowledgecastle.LearningR
doesn’thavetobedifficult;justtakeitslowly,askquestions,andgethelpearly.Goyou!
ItiseasytogethelpwithRrightattheconsole.Runningthehelp.start()functionatthe
promptwillstartamanualbrowser.Fromhere,youcandoanythingfromgoingoverthe
basicsofRtoreadingthenitty-grittydetailsonhowRworksinternally.
YoucangethelponaparticularfunctioninRifyouknowitsname,bysupplyingthat
nameasanargumenttothehelpfunction.Forexample,let’ssayyouwanttoknowmore
aboutthegsub()functionthatIsprangonyoubefore.Runningthefollowingcode:
>help("gsub")
>#orsimply
>?gsub
willdisplayamanualpagedocumentingwhatthefunctionis,howtouseit,andexamples
ofitsusage.
ThisrapidaccessibilitytodocumentationmeansthatI’mneverhopelesslylostwhenI
encounterafunctionwhichIhaven’tseenbefore.Thedownsidetothisextraordinarily
convenienthelpmechanismisthatIrarelybothertoremembertheorderofarguments,
sincelookingthemupisjustsecondsaway.
Occasionally,youwon’tquiteremembertheexactnameofthefunctionyou’relooking
for,butyou’llhaveanideaaboutwhatthenameshouldbe.Forthis,youcanusethe
help.search()function.
>help.search("chisquare")
>#orsimply
>??chisquare
Fortougher,moresemanticqueries,nothingbeatsagoodoldfashionedwebsearch
engine.Ifyoudon’tgetrelevantresultsthefirsttime,tryaddingthetermprogrammingor
statisticsinthereforgoodmeasure.
Vectors
VectorsarethemostbasicdatastructuresinR,andtheyareubiquitousindeed.Infact,
eventhesinglevaluesthatwe’vebeenworkingwiththusfarwereactuallyvectorsof
length1.That’swhytheinteractiveRconsolehasbeenprinting[1]alongwithallofour
output.
Vectorsareessentiallyanorderedcollectionofvaluesofthesameatomicdatatype.
Vectorscanbearbitrarilylarge(withsomelimitations),ortheycanbejustonesingle
value.
Thecanonicalwayofbuildingvectorsmanuallyisbyusingthec()function(which
standsforcombine).
>our.vect<-c(8,6,7,5,3,0,9)
>our.vect
[1]8675309
Intheprecedingexample,wecreatedanumericvectoroflength7(namely,Jenny’s
telephonenumber).
Notethatifwetriedtoputcharacterdatatypesintothisvectorasfollows:
>another.vect<-c("8",6,7,"-",3,"0",9)
>another.vect
[1]"8""6""7""-""3""0""9"
Rwouldconvertalltheitemsinthevector(calledelements)intocharacterdatatypesto
satisfytheconditionthatallelementsofavectormustbeofthesametype.Asimilarthing
happenswhenyoutrytouselogicalvaluesinavectorwithnumbers;thelogicalvalues
wouldbeconvertedinto1and0(forTRUEandFALSE,respectively).Theselogicalswill
turnintoTRUEandFALSE(notethequotationmarks)whenusedinavectorthatcontains
characters.
Subsetting
Itisverycommontowanttoextractoneormoreelementsfromavector.Forthis,weuse
atechniquecalledindexingorsubsetting.Afterthevector,weputanintegerinsquare
brackets([])calledthesubscriptoperator.ThisinstructsRtoreturntheelementatthat
index.Theindices(pluralforindex,incaseyouwerewondering!)forvectorsinRstartat
1,andstopatthelengthofthevector.
>our.vect[1]#togetthefirstvalue
[1]8
>#thefunctionlength()returnsthelengthofavector
>length(our.vect)
[1]7
>our.vect[length(our.vect)]#getthelastelementofavector
[1]9
Notethatintheprecedingcode,weusedafunctioninthesubscriptoperator.Incaseslike
these,Revaluatestheexpressioninthesubscriptoperator,andusesthenumberitreturns
astheindextoextract.
Ifwegetgreedy,andtrytoextractanelementatanindexthatdoesn’texist,Rwill
respondwithNA,meaning,notavailable.Weseethisspecialvaluecroppingupfromtime
totimethroughoutthistext.
>our.vect[10]
[1]NA
OneofthemostpowerfulideasinRisthatyoucanusevectorstosubsetothervectors:
>#extractthefirst,third,fifth,and
>#seventhelementfromourvector
>our.vect[c(1,3,5,7)]
[1]8739
Theabilitytousevectorstoindexothervectorsmaynotseemlikemuchnow,butits
usefulnesswillbecomeclearsoon.
Anotherwaytocreatevectorsisbyusingsequences.
>other.vector<-1:10
>other.vector
[1]12345678910
>another.vector<-seq(50,30,by=-2)
>another.vector
[1]5048464442403836343230
Above,the1:10statementcreatesavectorfrom1to10.10:1wouldhavecreatedthe
same10elementvector,butinreverse.Theseq()functionismoregeneralinthatit
allowssequencestobemadeusingsteps(amongmanyotherthings).
Combiningourknowledgeofsequencesandvectorssubsettingvectors,wecangetthe
first5digitsofJenny’snumberthusly:
>our.vect[1:5]
[1]86753
Vectorizedfunctions
PartofwhatmakesRsopowerfulisthatmanyofR’sfunctionstakevectorsasarguments.
Thesevectorizedfunctionsareusuallyextremelyfastandefficient.We’vealreadyseen
onesuchfunction,length(),buttherearemanymanyothers.
>#takesthemeanofavector
>mean(our.vect)
[1]5.428571
>sd(our.vect)#standarddeviation
[1]3.101459
>min(our.vect)
[1]0
>max(1:10)
[1]10
>sum(c(1,2,3))
[1]6
Inpracticalsettings,suchaswhenreadingdatafromfiles,itiscommontohaveNAvalues
invectors:
>messy.vector<-c(8,6,NA,7,5,NA,3,0,9)
>messy.vector
[1]86NA75NA309
>length(messy.vector)
[1]9
SomevectorizedfunctionswillnotallowNAvaluesbydefault.Inthesecases,anextra
keywordargumentmustbesuppliedalongwiththefirstargumenttothefunction.
>mean(messy.vector)
[1]NA
>mean(messy.vector,na.rm=TRUE)
[1]5.428571
>sum(messy.vector,na.rm=FALSE)
[1]NA
>sum(messy.vector,na.rm=TRUE)
[1]38
Asmentionedpreviously,vectorscanbeconstructedfromlogicalvaluestoo.
>log.vector<-c(TRUE,TRUE,FALSE)
>log.vector
[1]TRUETRUEFALSE
Sincelogicalvaluescanbecoercedintobehavinglikenumerics,aswesawearlier,ifwe
trytosumalogicalvectorasfollows:.
>sum(log.vector)
[1]2
wewill,essentially,getacountofthenumberofTRUEvaluesinthatvector.
TherearemanyfunctionsinRwhichoperateonvectorsandreturnlogicalvectors.
is.na()isonesuchfunction.Itreturnsalogicalvector—thatis,thesamelengthasthe
vectorsuppliedasanargument—withaTRUEinthepositionofeveryNAvalue.Remember
ourmessyvector(fromjustaminuteago)?
>messy.vector
[1]86NA75NA309
>is.na(messy.vector)
[1]FALSEFALSETRUEFALSEFALSETRUEFALSEFALSEFALSE
>#86NA75NA309
Puttingtogetherthesepiecesofinformation,wecangetacountofthenumberofNA
valuesinavectorasfollows:
>sum(is.na(messy.vector))
[1]2
WhenyouuseBooleanoperatorsonvectors,theyalsoreturnlogicalvectorsofthesame
lengthasthevectorbeingoperatedon.
>our.vect>5
[1]TRUETRUETRUEFALSEFALSEFALSETRUE
Ifwewantedto—andwedo—countthenumberofdigitsinJenny’sphonenumberthat
aregreaterthanfive,wewoulddosointhefollowingmanner:
>sum(our.vect>5)
[1]4
Advancedsubsetting
DidImentionthatwecanusevectorstosubsetothervectors?Whenwesubsetvectors
usinglogicalvectorsofthesamelength,onlytheelementscorrespondingtotheTRUE
valuesareextracted.Hopefully,sparksarestartingtogooffinyourhead.Ifwewantedto
extractonlythelegitimatenon-NAdigitsfromJenny’snumber,wecandoitasfollows:
>messy.vector[!is.na(messy.vector)]
[1]8675309
ThisisaverycriticaltraitofR,solet’stakeourtimeunderstandingit;thisidiomwill
comeupagainandagainthroughoutthisbook.
ThelogicalvectorthatyieldsTRUEwhenanNAvalueoccursinmessy.vector(from
is.na())isthennegated(thewholething)bythenegationoperator!.Theresultantvector
isTRUEwheneverthecorrespondingvalueinmessy.vectorisnotNA.Whenthislogical
vectorisusedtosubsettheoriginalmessyvector,itonlyextractsthenon-NAvaluesfrom
it.
Similarly,wecanshowallthedigitsinJenny’sphonenumberthataregreaterthanfiveas
follows:
>our.vect[our.vect>5]
[1]8679
Thusfar,we’veonlybeendisplayingelementsthathavebeenextractedfromavector.
However,justaswe’vebeenassigningandre-assigningvariables,wecanassignvaluesto
variousindicesofavector,andchangethevectorasaresult.Forexample,ifJennytellsus
thatwehavethefirstdigitofherphonenumberwrong(it’sreally9),wecanreassignjust
thatelementwithoutmodifyingtheothers.
>our.vect
[1]8675309
>our.vect[1]<-9
>our.vect
[1]9675309
Sometimes,itmayberequiredtoreplacealltheNAvaluesinavectorwiththevalue0.To
dothatwithourmessyvector,wecanexecutethefollowingcommand:
>messy.vector[is.na(messy.vector)]<-0
>messy.vector
[1]860750309
Elegantthoughtheprecedingsolutionis,modifyingavectorinplaceisusually
discouragedinfavorofcreatingacopyoftheoriginalvectorandmodifyingthecopy.One
suchtechniqueforperformingthisisbyusingtheifelse()function.
Nottobeconfusedwiththeif/elsecontrolconstruct,ifelse()isafunctionthattakes3
arguments:atestthatreturnsalogical/Booleanvalue,avaluetouseiftheelementpasses
thetest,andonetoreturniftheelementfailsthetest.
Theprecedingin-placemodificationsolutioncouldbere-implementedwithifelseas
follows:
>ifelse(is.na(messy.vector),0,messy.vector)
[1]860750309
Recycling
ThelastimportantpropertyofvectorsandvectoroperationsinRisthattheycanbe
recycled.TounderstandwhatImean,examinethefollowingexpression:
>our.vect+3
[1]1291086312
ThisexpressionaddsthreetoeachdigitinJenny’sphonenumber.Althoughitmaylook
so,Risnotperformingthisoperationbetweenavectorandasinglevalue.Remember
whenIsaidthatsinglevaluesareactuallyvectorsofthelength1?Whatisreally
happeninghereisthatRistoldtoperformelement-wiseadditiononavectoroflength7
andavectoroflength1.Sinceelement-wiseadditionisnotdefinedforvectorsofdiffering
lengths,Rrecyclesthesmallervectoruntilitreachesthesamelengthasthatofthebigger
vector.Onceboththevectorsarethesamesize,thenR,element-by-element,performsthe
additionandreturnstheresult.
>our.vect+3
[1]1291086312
istantamountto…
>our.vect+c(3,3,3,3,3,3,3)
[1]1291086312
IfwewantedtoextracteveryotherdigitfromJenny’sphonenumber,wecandosointhe
followingmanner:
>our.vect[c(TRUE,FALSE)]
[1]9739
Thisworksbecausethevectorc(TRUE,FALSE)isrepeateduntilitisofthelength7,
makingitequivalenttothefollowing:
>our.vect[c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE)]
[1]9739
OnecommonsnagrelatedtovectorrecyclingthatRusers(useRs,ifImay)encounteris
thatduringsomearithmeticoperationsinvolvingvectorsofdiscrepantlength,Rwillwarn
youifthesmallervectorcannotberepeatedawholenumberoftimestoreachthelength
ofthebiggervector.Thisisnotaproblemwhendoingvectorarithmeticwithsingle
values,since1canberepeatedanynumberoftimestomatchthelengthofanyvector
(whichmust,ofcourse,beaninteger).Itwouldposeaproblem,though,ifwewere
lookingtoaddthreetoeveryotherelementinJenny’sphonenumber.
>our.vect+c(3,0)
[1]1261056012
Warningmessage:
Inour.vect+c(3,0):
longerobjectlengthisnotamultipleofshorterobjectlength
Youwilllikelylearntolovethesewarnings,astheyhavestoppedmanyuseRsfrom
makinggraveerrors.
Beforewemoveontothenextsection,animportantthingtonoteisthatinalotofother
programminglanguages,manyofthethingsthatwedidwouldhavebeenimplemented
usingforloopsandothercontrolstructures.Althoughthereiscertainlyaplaceforloops
andsuchinR,oftentimesamoresophisticatedsolutionexistsinusingjustvector/matrix
operations.Inadditiontoeleganceandbrevity,thesolutionthatexploitsvectorizationand
recyclingisoftenmany,manytimesmoreefficient.
Functions
Ifweneedtoperformsomecomputationthatisn’talreadyafunctioninRamultiple
numberoftimes,weusuallydosobydefiningourownfunctions.AcustomfunctioninR
isdefinedusingthefollowingsyntax:
function.name<-function(argument1,argument2,...){
#somefunctionality
}
Forexample,ifwewantedtowriteafunctionthatdeterminedifanumbersuppliedasan
argumentwaseven,wecandosointhefollowingmanner:
>is.even<-function(a.number){
+remainder<-a.number%%2
+if(remainder==0)
+return(TRUE)
+return(FALSE)
+}
>
>#testingit
>is.even(10)
[1]TRUE
>is.even(9)
[1]FALSE
Asanexampleofafunctionthattakesmorethanoneargument,let’sgeneralizethe
precedingfunctionbycreatingafunctionthatdetermineswhetherthefirstargumentis
divisiblebyitssecondargument.
>is.divisible.by<-function(large.number,smaller.number){
+if(large.number%%smaller.number!=0)
+return(FALSE)
+return(TRUE)
+
}
>
>#testingit
>is.divisible.by(10,2)
[1]TRUE
>is.divisible.by(10,3)
[1]FALSE
>is.divisible.by(9,3)
[1]TRUE
Ourfunction,is.even(),couldnowberewrittensimplyas:
>is.even<-function(num){
+is.divisible.by(num,2)
+}
ItisverycommoninRtowanttoapplyaparticularfunctiontoeveryelementofavector.
Insteadofusingalooptoiterateovertheelementsofavector,aswewoulddoinmany
otherlanguages,weuseafunctioncalledsapply()toperformthis.sapply()takesa
vectorandafunctionasitsargument.Itthenappliesthefunctiontoeveryelementand
returnsavectorofresults.Wecanusesapply()inthismannertofindoutwhichdigitsin
Jenny’sphonenumberareeven:
>sapply(our.vect,is.even)
[1]FALSETRUEFALSEFALSEFALSETRUEFALSE
Thisworkedgreatbecausesapplytakeseachelement,andusesitastheargumentin
is.even()whichtakesonlyoneargument.Ifyouwantedtofindthedigitsthatare
divisiblebythree,itwouldrequirealittlebitmorework.
Oneoptionisjusttodefineafunctionis.divisible.by.three()thattakesonlyone
argument,andusethatinsapply.Themorecommonsolution,however,istodefinean
unnamedfunctionthatdoesjustthatinthebodyofthesapplyfunctioncall:
>sapply(our.vect,function(num){is.divisible.by(num,3)})
[1]TRUETRUEFALSEFALSETRUETRUETRUE
Here,weessentiallycreatedafunctionthatcheckswhetheritsargumentisdivisibleby
three,exceptwedon’tassignittoavariable,anduseitdirectlyinthesapplybody
instead.Theseone-time-useunnamedfunctionsarecalledanonymousfunctionsorlambda
functions.(ThenamecomesfromAlonzoChurch’sinventionofthelambdacalculus,if
youwerewondering.)
ThisissomewhatofanadvancedusageofR,butitisveryusefulasitcomesupveryoften
inpractice.
IfwewantedtoextractthedigitsinJenny’sphonenumberthataredivisiblebyboth,two
andthree,wecanwriteitasfollows:
>where.even<-sapply(our.vect,is.even)
>where.div.3<-sapply(our.vect,function(num){
+is.divisible.by(num,3)})
>#"&"islikethe"&&"andoperatorbutforvectors
>our.vect[where.even&where.div.3]
[1]60
Neat-O!
Notethatifwewantedtobesticklers,wewouldhaveaclauseinthefunctionbodiesto
precludeamoduluscomputation,wherethefirstnumberwassmallerthanthesecond.If
wehad,ourfunctionwouldnothaveerroneouslyindicatedthat0wasdivisiblebytwoand
three.I’mnotastickler,though,sothefunctionswillremainasis.Fixingthisfunctionis
leftasanexerciseforthe(stickler)reader.
Matrices
Inadditiontothevectordatastructure,Rhasthematrix,dataframe,list,andarraydata
structures.Thoughwewillbeusingallthesetypes(exceptarrays)inthisbook,weonly
needtoreviewthefirsttwointhischapter.
AmatrixinR,likeinmath,isarectangulararrayofvalues(ofonetype)arrangedinrows
andcolumns,andcanbemanipulatedasawhole.Operationsonmatricesarefundamental
todataanalysis.
Onewayofcreatingamatrixistojustsupplyavectortothefunctionmatrix().
>a.matrix<-matrix(c(1,2,3,4,5,6))
>a.matrix
[,1]
[1,]1
[2,]2
[3,]3
[4,]4
[5,]5
[6,]6
Thisproducesamatrixwithallthesuppliedvaluesinasinglecolumn.Wecanmakea
similarmatrixwithtwocolumnsbysupplyingmatrix()withanoptionalargument,ncol,
thatspecifiesthenumberofcolumns.
>a.matrix<-matrix(c(1,2,3,4,5,6),ncol=2)
>a.matrix
[,1][,2]
[1,]14
[2,]25
[3,]36
Wecouldhaveproducedthesamematrixbybindingtwovectors,c(1,2,3)andc(4,
5,6)bycolumnsusingthecbind()functionasfollows:
>a2.matrix<-cbind(c(1,2,3),c(4,5,6))
Wecouldcreatethetranspositionofthismatrix(whererowsandcolumnsareswitched)by
bindingthosevectorsbyrowinstead:
>a3.matrix<-rbind(c(1,2,3),c(4,5,6))
>a3.matrix
[,1][,2][,3]
[1,]123
[2,]456
orbyjustusingthematrixtranspositionfunctioninR,t().
>t(a2.matrix)
SomeotherfunctionsthatoperateonwholevectorsarerowSums()/colSums()and
rowMeans()/colMeans().
>a2.matrix
[,1][,2]
[1,]14
[2,]25
[3,]36
>colSums(a2.matrix)
[1]615
>rowMeans(a2.matrix)
[1]2.53.54.5
Ifvectorshavesapply(),thenmatriceshaveapply().Theprecedingtwofunctionscould
havebeenwritten,moreverbosely,as:
>apply(a2.matrix,2,sum)
[1]615
>apply(a2.matrix,1,mean)
[1]2.53.54.5
where1instructsRtoperformthesuppliedfunctionoveritsrows,and2,overits
columns.
ThematrixmultiplicationoperatorinRis%*%
>a2.matrix%*%a2.matrix
Errorina2.matrix%*%a2.matrix:non-conformablearguments
Remember,matrixmultiplicationisonlydefinedformatriceswherethenumberof
columnsinthefirstmatrixisequaltothenumberofrowsinthesecond.
>a2.matrix
[,1][,2]
[1,]14
[2,]25
[3,]36
>a3.matrix
[,1][,2][,3]
[1,]123
[2,]456
>a2.matrix%*%a3.matrix
[,1][,2][,3]
[1,]172227
[2,]222936
[3,]273645
>
>#dim()tellsushowmanyrowsandcolumns
>#(respectively)thereareinthegivenmatrix
>dim(a2.matrix)
[1]32
Toindextheelementofamatrixatthesecondrowandfirstcolumn,youneedtosupply
bothofthesenumbersintothesubscriptingoperator.
>a2.matrix[2,1]
[1]2
ManyuseRsgetconfusedandforgettheorderinwhichtheindicesmustappear;
remember—it’srowfirst,thencolumns!
Ifyouleaveoneofthespacesempty,Rwillassumeyouwantthatwholedimension:
>#returnsthewholesecondcolumn
>a2.matrix[,2]
[1]456
>#returnsthefirstrow
>a2.matrix[1,]
[1]14
And,asalways,wecanusevectorsinoursubscriptoperator:
>#givemeelementincolumn2atthefirstandthirdrow
>a2.matrix[c(1,3),2]
[1]46
LoadingdataintoR
Thusfar,we’veonlybeenenteringdatadirectlyintotheinteractiveRconsole.Forany
datasetofnon-trivialsizethisis,obviously,anintractablesolution.Fortunatelyforus,R
hasarobustsuiteoffunctionsforreadingdatadirectlyfromexternalfiles.
Goahead,andcreateafileonyourharddiskcalledfavorites.txtthatlookslikethis:
flavor,number
pistachio,6
mintchocolatechip,7
vanilla,5
chocolate,10
strawberry,2
neopolitan,4
Thisdatarepresentsthenumberofstudentsinaclassthatpreferaparticularflavorofsoy
icecream.Wecanreadthefileintoavariablecalledfavsasfollows:
>favs<-read.table("favorites.txt",sep=",",header=TRUE)
Ifyougetanerrorthatthereisnosuchfileordirectory,giveRthefullpathnametoyour
datasetor,alternatively,runthefollowingcommand:
>favs<-read.table(file.choose(),sep=",",header=TRUE)
Theprecedingcommandbringsupanopenfiledialogforlettingyounavigatetothefile
you’vejustcreated.
Theargumentsep=","tellsRthateachdataelementinarowisseparatedbyacomma.
Othercommondataformatshavevaluesseparatedbytabsandpipes("|").Thevalueof
sepshouldthenbe"\t"and"|",respectively.
Theargumentheader=TRUEtellsRthatthefirstrowofthefileshouldbeinterpretedasthe
namesofthecolumns.Remember,youcanenter?read.tableattheconsoletolearnmore
abouttheseoptions.
Readingfromfilesinthiscomma-separated-valuesformat(usuallywiththe.csvfile
extension)issocommonthatRhasamorespecificfunctionjustforit.Theprecedingdata
importexpressioncanbebestwrittensimplyas:
>favs<-read.csv("favorites.txt")
Now,wehaveallthedatainthefileheldinavariableofclassdata.frame.Adataframe
canbethoughtofasarectangulararrayofdatathatyoumightseeinaspreadsheet
application.Inthisway,adataframecanalsobethoughtofasamatrix;indeed,wecan
usematrix-styleindexingtoextractelementsfromit.Adataframediffersfromamatrix,
though,inthatadataframemayhavecolumnsofdifferingtypes.Forexample,whereasa
matrixwouldonlyallowoneofthesetypes,thedatasetwejustloadedcontainscharacter
datainitsfirstcolumn,andnumericdatainitssecondcolumn.
Let’scheckoutwhatwehavebyusingthehead()command,whichwillshowusthefirst
fewlinesofadataframe:
>head(favs)
flavornumber
1pistachio6
2mintchocolatechip7
3vanilla5
4chocolate10
5strawberry2
6neopolitan4
>class(favs)
[1]"data.frame"
>class(favs$flavor)
[1]"factor"
>class(favs$number)
[1]"numeric"
Ilied,ok!Sowhat?!Technically,flavorisafactordatatype,notacharactertype.
Wehaven’tseenfactorsyet,buttheideabehindthemisreallysimple.Essentially,factors
arecodingsforcategoricalvariables,whicharevariablesthattakeononeofafinite
numberofcategories—think{"high","medium",and"low"}or{"control",
"experimental"}.
ThoughfactorsareextremelyusefulinstatisticalmodelinginR,thefactthatR,by
default,automaticallyinterpretsacolumnfromthedatareadfromdiskasatypefactorifit
containscharacters,issomethingthattripsupnovicesandseasoneduseRsalike.Because
ofthis,wewillprimarilypreventthisbehaviormanuallybyaddingthestringsAsFactors
optionalkeywordargumenttotheread.*commands:
>favs<-read.csv("favorites.txt",stringsAsFactors=FALSE)
>class(favs$flavor)
[1]"character"
Muchbetter,fornow!Ifyou’dliketomakethisbehaviorthenewdefault,readthe?
optionsmanualpage.Wecanalwaysconverttofactorslateronifweneedto!
Ifyouhaven’tnoticedalready,I’vesnuckanewoperatoronyou—$,theextractoperator.
Thisisthemostpopularwaytoextractattributes(orcolumns)fromadataframe.Youcan
alsousedoublesquarebrackets([[and]])todothis.
Thesearebothinadditiontothecanonicalmatrixindexingoption.Thefollowingthree
statementsarethus,inthiscontext,functionallyidentical:
>favs$flavor
[1]"pistachio""mintchocolatechip""vanilla"
[4]"chocolate""strawberry""neopolitan"
>favs[["flavor"]]
[1]"pistachio""mintchocolatechip""vanilla"
[4]"chocolate""strawberry""neopolitan"
>favs[,1]
[1]"pistachio""mintchocolatechip""vanilla"
[4]"chocolate""strawberry""neopolitan"
Note
NoticehowRhasnowprintedanothernumberinsquarebrackets—besides[1]—along
withouroutput.Thisistoshowusthatchocolateisthefourthelementofthevectorthat
wasreturnedfromtheextraction.
Youcanusethenames()functiontogetalistofthecolumnsavailableinadataframe.
Youcanevenreassignnamesusingthesame:
>names(favs)
[1]"flavor""number"
>names(favs)[1]<-"flav"
>names(favs)
[1]"flav""number"
Lastly,wecangetacompactdisplayofthestructureofadataframebyusingthestr()
functiononit:
>str(favs)
'data.frame':6obs.of2variables:
$flav:chr"pistachio""mintchocolatechip""vanilla""chocolate"
...
$number:num6751024
Actually,youcanusethisfunctiononanyRstructure—thepropertyoffunctionsthat
changetheirbehaviorbasedonthetypeofinputiscalledpolymorphism.
Workingwithpackages
Robust,performant,andnumerousthoughbaseR’sfunctionsare,wearebynomeans
limitedtothem!Additionalfunctionalityisavailableintheformofpackages.Infact,what
makesRsuchaformidablestatisticsplatformistheastonishingwealthofpackages
available(wellover7,000atthetimeofwriting).R’secosystemissecondtonone!
MostofthesemyriadpackagesexistontheComprehensiveRArchiveNetwork
(CRAN).CRANistheprimaryrepositoryforuser-createdpackages.
Onepackagethatwearegoingtostartusingrightawayistheggplot2package.ggplot2is
aplottingsystemforR.BaseRhassophisticatedandadvancedmechanismstoplotdata,
butmanyfindggplot2moreconsistentandeasiertouse.Further,theplotsareoftenmore
aestheticallypleasingbydefault.
Let’sinstallit!
#downloadsandinstallsfromCRAN
>install.packages("ggplot2")
Nowthatwehavethepackagedownloaded,let’sloaditintotheRsession,andtestitout
byplottingourdatafromthelastsection:
>library(ggplot2)
>ggplot(favs,aes(x=flav,y=number))+
+geom_bar(stat="identity")+
+ggtitle("Soyicecreamflavorpreferences")
Figure1.1:Soyicecreamflavorpreferences
You’reallwrong,MintChocolateChipiswaybetter!
Don’tworryaboutthesyntaxoftheggplotfunction,yet.We’llgettoitingoodtime.
Youwillbeinstallingsomemorepackagesasyouworkthroughthistext.Inthe
meantime,ifyouwanttoplayaroundwithafewmorepackages,youcaninstallthegdata
andforeignpackagesthatallowyoutodirectlyimportExcelspreadsheetsandSPSSdata
filesrespectivelydirectlyintoR.
Exercises
Youcanpracticethefollowingexercisestohelpyougetagoodgraspoftheconcepts
learnedinthischapter:
Writeafunctioncalledsimon.saysthattakesinacharacterstring,andreturnsthat
stringinalluppercaseafterprependingthestring“Simonsays:”tothebeginningof
it.
Writeafunctionthattakestwomatricesasarguments,andreturnsalogicalvalue
representingwhetherthematricescanbematrixmultiplied.
Findafreedatasetontheweb,downloadit,andloaditintoR.Explorethestructure
ofthedataset.
ReflectuponhowHesterPrynneallowedherscarletlettertobedecoratedwith
flowersbyherdaughterinChapter10.TowhatextentisthisindicativeofHester’s
recastingofthescarletletterasapositivepartofheridentity.Backupyourthesis
withexcerptsfromthebook.
Summary
Inthischapter,welearnedabouttheworld’sgreatestanalyticsplatform,R.Westarted
fromthebeginningandbuiltafoundation,andwillnowexploreRfurther,basedonthe
knowledgegainedinthischapter.Bynow,youhavebecomewellversedinthebasicsofR
(which,paradoxically,isthehardestpart).Younowknowhowto:
UseRasabigcalculatortodoarithmetic
Makevectors,operateonthem,andsubsetthemexpressively
Loaddatafromdisk
Installpackages
YouhavebynomeansfinishedlearningaboutR;indeed,wehavegoneovermostlyjust
thebasics.However,wehaveenoughtocontinueahead,andyou’llpickupmorealong
theway.Onwardtostatisticsland!
Chapter2.TheShapeofData
Welcomeback!SincewenowhaveenoughknowledgeaboutRunderourbelt,wecan
finallymoveontoapplyingit.So,joinmeaswejumpoutoftheRfryingpanandintothe
statisticsfire.
Univariatedata
Inthischapter,wearegoingtodealwithunivariatedata,whichisafancywayofsaying
samplesofonevariable—thekindofdatathatgoesintoasingleRvector.Analysisof
univariatedataisn’tconcernedwiththewhyquestions—causes,relationships,oranything
likethat;thepurposeofunivariateanalysisissimplytodescribe.
Inunivariatedata,onevariable—let’scallitx—canrepresentcategorieslikesoyice
creamflavors,headsortails,namesofcuteclassmates,therollofadie,andsoon.In
caseslikethese,wecallxacategoricalvariable.
>categorical.data<-c("heads","tails","tails","heads")
Categoricaldataisrepresented,intheprecedingstatement,asavectorofcharactertype.
Inthisparticularexample,wecouldfurtherspecifythatthisisabinaryordichotomous
variable,becauseitonlytakesontwovalues,namely,“heads”and“tails.”
Ourvariablexcouldalsorepresentanumberlikeairtemperature,thepricesoffinancial
instruments,andsoon.Insuchcases,wecallthisacontinuousvariable.
>contin.data<-c(198.41,178.46,165.20,141.71,138.77)
Univariatedataofacontinuousvariableisrepresented,asseenintheprecedingstatement,
asavectorofnumerictype.Thesedataarethestockpricesofahypotheticalcompanythat
offersahypotheticalcommercialstatisticsplatforminferiortoR.
Youmightcometotheconclusionthatifavectorcontainscharactertypes,itisa
categoricalvariable,andifitcontainsnumerictypes,itisacontinuousvariable.Notquite!
Considerthecaseofdatathatcontainstheresultsoftherollofasix-sideddie.Anatural
approachtostoringthiswouldbebyusinganumericvector.However,thisisn’ta
continuousvariable,becauseeachresultcanonlytakeonsixdistinctvalues:1,2,3,4,5,
and6.Thisisadiscretenumericvariable.Otherdiscretenumericvariablescanbethe
numberofbacteriainapetridish,orthenumberofloveletterstocuteclassmates.
Themarkofacontinuousvariableisthatitcouldtakeonanyvaluebetweensome
theoreticalminimumandmaximum.Therangeofvaluesincaseofadierollhavea
minimumof1andamaximumof6,butitcanneverbe2.3.Contrastthiswith,say,the
exampleofthestockprices,whichcouldbezero,zillions,oranythinginbetween.
Onoccasion,weareunabletoneatlyclassifynon-categoricaldataaseithercontinuousor
discrete.Insomecases,discretevariablesmaybetreatedasifthereisanunderlying
continuum.Additionally,continuousvariablescanbediscretized,aswe’llseesoon.
Frequencydistributions
Acommonwayofdescribingunivariatedataiswithafrequencydistribution.We’ve
alreadyseenanexampleofafrequencydistributionwhenwelookedatthepreferencesfor
soyicecreamattheendofthelastchapter.Foreachflavoroficecream(categorical
variable),itdepictedthecountorfrequencyoftheoccurrencesintheunderlyingdataset.
Todemonstrateexamplesofotherfrequencydistributions,weneedtofindsomedata.
Fortunately,fortheconvenienceofuseRseverywhere,Rcomespreloadedwithalmostone
hundreddatasets.Youcanviewafulllistifyouexecutehelp(package="datasets").
Therearealsohundredsmoreavailablefromaddonpackages.
Thefirstdatasetthatwearegoingtouseismtcars—dataonthedesignandperformance
of32automobilesthatwasextractedfromthe1974MotorTrendUSmagazine.(Tofind
outmoreinformationaboutthisdataset,execute?mtcars.)
Takealookatthefirstfewlinesofthisdatasetusingtheheadfunction:
>head(mtcars)
mpgcyldisphpdratwtqsecvsamgearcarb
MazdaRX421.061601103.902.62016.460144
MazdaRX4Wag21.061601103.902.87517.020144
Datsun71022.84108933.852.32018.611141
Hornet4Drive21.462581103.083.21519.441031
HornetSportabout18.783601753.153.44017.020032
Valiant18.162251052.763.46020.221031
Checkoutthecarbcolumn,whichrepresentsthenumberofcarburetors;bynowyou
shouldrecognizethisasadiscretenumericvariable,thoughwecan(andwill!)treatthisas
acategoricalvariablefornow.
Runningthecarbvectorthroughtheuniquefunctionyieldsthedistinctvaluesthatthis
vectorcontains.
>unique(mtcars$carb)
[1]412368
Wecanseethattheremustberepeatsinthecarbvector,buthowmany?Aneasywayfor
performingafrequencytabulationinRistousethetablefunction:
>table(mtcars$carb)
123468
71031011
Fromtheresultoftheprecedingfunction,wecantellthattheare10carswith2
carburetorsand10with4,andthereisonecareachwith6and8carburetors.Thevalue
withthemostoccurrencesinadataset(inthisexample,thecarbcolumnisourwholedata
set)iscalledthemode.Inthiscase,therearetwosuchvalues,2and4,sothisdatasetis
bimodal.(ThereisapackageinR,calledmodeest,tofindmodeseasily.)
Frequencydistributionsaremoreoftendepictedasachartorplotthanasatableof
numbers.Whentheunivariatedataiscategorical,itiscommonlyrepresentedasabar
chart,asshownintheFigure2.1:
Theotherdatasetthatwearegoingtousetodemonstrateafrequencydistributionofa
continuousvariableistheairqualitydataset,whichholdsthedailyairquality
measurementsfromMaytoSeptemberinNY.Takealookatitusingtheheadandstr
functions.TheunivariatedatathatwewillbeusingistheTempcolumn,whichcontainsthe
temperaturedataindegreesFahrenheit.
Figure2.1:Frequencydistributionofnumberofcarburetorsinmtcarsdataset
Itwouldbeuselesstotakethesameapproachtofrequencytabulationaswedidinthecase
ofthecarcarburetors.Ifwedidso,wewouldhaveatablecontainingthefrequenciesfor
eachofthe40uniquetemperatures—andtherewouldbefarmoreifthetemperature
wasn’troundedtothenearestdegree.Additionally,whocaresthattherewasone
occurrenceof63degreesandtwooccurrencesof64?Isuredon’t!Whatwedocareabout
istheapproximatetemperature.
Ourfirststeptowardsbuildingafrequencydistributionofthetemperaturedataistobin
thedata—whichistosay,wedividetherangeofvaluesofthevectorintoaseriesof
smallerintervals.Thisbinningisamethodofdiscretizingacontinuousvariable.Wethen
countthenumberofvaluesthatfallintothatinterval.
Choosingthesizeofbinstouseistricky.Iftherearetoomanybins,werunintothesame
problemaswedidwiththerawdataandhaveanunwieldynumberofcolumnsinour
frequencytabulation.Ifwemaketoofew,however,weloseresolutionandmaylose
importantinformation.Choosingtherightnumberofbinsismoreartthanscience,but
therearecertaincommonlyusedheuristicsthatoftenproducesensibleresults.
WecanhaveRconstructnnumberofequally-spacedbinsforusbyusingthecutfunction
which,initssimplestusecase,takesavectorofdataandthenumberofbinstocreate:
>cut(airquality$Temp,9)
Wecanthenfeedthisresultintothetablefunctionforafarmoremanageablefrequency
tabulation:
>table(cut(airquality$Temp,9))
(56,60.6](60.6,65.1](65.1,69.7](69.7,74.2](74.2,78.8]
810141626
(78.8,83.3](83.3,87.9](87.9,92.4](92.4,97]
3522157
Rad!
Rememberwhenweusedabarcharttovisualizethefrequencydistributionsofcategorical
data?Thecommonmethodforvisualizingthedistributionofdiscretizedcontinuousdata
isbyusingahistogram,asseeninthefollowingimage:
Figure2.2:DailytemperaturemeasurementsfromMaytoSeptemberinNYC
Centraltendency
OneverypopularquestiontoaskaboutunivariatedataisWhatisthetypicalvalue?or
What’sthevaluearoundwhichthedataarecentered?.Toanswerthesequestions,wehave
tomeasurethecentraltendencyofasetofdata.
We’veseenonemeasureofcentraltendencyalready:themode.Themtcars$carburetors
datasubsetwasbimodal,withatwoandfourcarburetorsetupbeingthemostpopular.The
modeisthecentraltendencymeasurethatisapplicabletocategoricaldata.
Themodeofadiscretizedcontinuousdistributionisusuallyconsideredtobetheinterval
thatcontainsthehighestfrequencyofdatapoints.Thismakesitdependentonthemethod
andparametersofthebinning.Findingthemodeofdatafromanon-discretized
continuousdistributionisamorecomplicatedprocedure,whichwe’llseelater.
Perhapsthemostfamousandcommonlyusedmeasureofcentraltendencyisthemean.
Themeanisthesumofasetofnumericsdividedbythenumberofelementsinthatset.
Thissimpleconceptcanalsobeexpressedasacomplex-lookingequation:
Where (pronouncedxbar)isthemean,
isthesummationoftheelementsinthe
dataset,andnisthenumberofelementsintheset.(Asanaside,ifyouareintimidatedby
theequationsinthisbook,don’tbe!Noneofthemarebeyondyourgrasp—justthinkof
themassentencesofalanguageyou’renotproficientinyet.)
Themeanisrepresentedas whenwearetalkingaboutthemeanofasample(orsubset)
ofalargerpopulation,andµwhenwearetalkingaboutthemeanofthepopulation.A
populationmayhavetoomanyitemstocomputethemeandirectly.Whenthisisthecase,
werelyonstatisticsappliedtoasampleofthepopulationtoestimateitsparameters.
AnotherwaytoexpresstheprecedingequationusingRconstructsisasfollows:
>sum(nums)/length(nums)#numswouldbeavectorofnumerics
Asyoumightimagine,though,themeanhasaneponymousRfunctionthatisbuilt-in
already:
>mean(c(1,2,3,4,5))
[1]3
Themeanisnotdefinedforcategoricaldata;rememberthatmodeistheonlymeasureof
centraltendencythatwecanusewithcategoricaldata.
Themean—occasionallyreferredtoasthearithmeticmeantocontrastwiththefarless
oftenusedgeometric,harmonic,andtrimmedmeans—whileextraordinarilypopularisnot
averyrobuststatistic.Thisisbecausethestatisticisundulyaffectedbyoutliers(atypically
distantdatapointsorobservations).Aparadigmaticexamplewheretherobustnessofthe
meanfailsisitsapplicationtothedifferentdistributionsofincome.
ImaginethewagesofemployeesinacompanycalledMarx&Engels,AttorneysatLaw,
wherethetypicalworkermakes$40,000ayearwhiletheCEOmakes$500,000ayear.If
wecomputethemeanofthesalariesbasedonasampleoftenthatcontainsjustthe
exploitedclass,wewillhaveafairlyaccuraterepresentationoftheaveragesalaryofa
workeratthatcompany.Ifhowever,bytheluckofthedraw,oursamplecontainsthe
CEO,themeanofthesalarieswillskyrockettoavaluethatisnolongerrepresentativeor
veryinformative.
Morespecifically,robuststatisticsarestatisticalmeasuresthatworkwellwhenthrownata
widevarietyofdifferentdistributions.Themeanworkswellwithoneparticulartypeof
distribution,thenormaldistribution,and,tovaryingdegrees,failstoaccuratelyrepresent
thecentraltendencyofotherdistributions.
Figure2.3:Anormaldistribution
Thenormaldistribution(alsocalledtheGaussiandistributionifyouwanttoimpress
people)isfrequentlyreferredtoasthebellcurvebecauseofitsshape.Asseeninthe
precedingimage,thevastmajorityofthedatapointsliewithinanarrowbandaroundthe
centerofthedistribution—whichisthemean.Asyougetfurtherandfurtherfromthe
mean,theobservationsbecomelessandlessfrequent.Itisasymmetricdistribution,
meaningthatthesidethatistotherightofthemeanisamirrorimageoftheleftsideof
themean.
Notonlyistheusageofthenormaldistributionextremelycommoninstatistics,butitis
alsoubiquitousinreallife,whereitcanmodelanythingfrompeople’sheightstotest
scores;afewwillfarelowerthanaverage,andafewfarehigherthanaverage,butmost
arearoundaverage.
Theutilityofthemeanasameasureofcentraltendencybecomesstrainedasthenormal
distributionbecomesmoreandmoreskewed,orasymmetrical.
Ifthemajorityofthedatapointsfallontheleftsideofthedistribution,withtherightside
taperingoffslowerthantheleft,thedistributionisconsideredpositivelyskewedorrighttailed.Ifthelongertailisontheleftsideandthebulkofthedistributionishangingoutto
theright,itiscallednegativelyskewedorleft-tailed.Thiscanbeseenclearlyinthe
followingimages:
Figure2.4a:Anegativelyskeweddistribution
Figure2.4b:Apositivelyskeweddistribution
Luckily,forcasesofskeweddistributions,orotherdistributionsforwhichthemeanis
inadequatetodescribe,wecanusethemedianinstead.
Themedianofadatasetisthemiddlenumberinthesetafteritissorted.Lessconcretely,
itisthevaluethatcleanlyseparatesthehigher-valuedhalfofthedataandthelower-valued
half.
Themedianofthesetofnumbers{1,3,5,6,7}is5.Inthesetofnumberswithan
evennumberofelements,themeanofthetwomiddlevaluesistakentobethemedian.
Forexample,themedianoftheset{3,3,6,7,7,10}is6.5.Themedianisthe50th
percentile,meaningthat50percentoftheobservationsfallbelowthatvalue.
>median(c(3,7,6,10,3,7))
[1]6.5
ConsidertheexampleofMarx&Engels,AttorneysatLawthatwereferredtoearlier.
Rememberthatifthesampleofemployees’salariesincludedtheCEO,itwouldgiveour
meananon-representativevalue.Themediansolvesourproblembeautifully.Let’ssayour
sampleof10employees’salarieswas{41000,40300,38000,500000,41500,37000,
39600,42000,39900,39500}.Giventhisset,themeansalaryis$85,880butthemedianis
$40,100—waymoreinlinewiththesalaryexpectationsoftheproletariatatthelawfirm.
Insymmetricdata,themeanandmedianareoftenveryclosetoeachotherinvalue,ifnot
identical.Inasymmetricdata,thisisnotthecase.Itistellingwhenthemedianandthe
meanareverydiscrepant.Ingeneral,ifthemedianislessthanthemean,thedatasethasa
largerighttailoroutliers/anomalies/erroneousdatatotherightofthedistribution.Ifthe
meanislessthanthemedian,ittellstheoppositestory.Thedegreeofdifferencebetween
themeanandthemedianisoftenanindicationofthedegreeofskewness.
Thispropertyofthemedian—resistancetotheinfluenceofoutliers—makesitarobust
statistic.Infact,themedianisthemostoutlier-resistantmetricinstatistics.
Asgreatasthemedianis,it’sfarfrombeingperfecttodescribedatajustbyitsown.To
seewhatImean,checkoutthethreedistributionsinthefollowingimage.Allthreehave
thesamemeanandmedian,yetallthreeareverydifferentdistributions.
Clearly,weneedtolooktootherstatisticalmeasurestodescribethesedifferences.
Note
Beforegoingontothenextchapter,checkoutthesummaryfunctioninR.
Figure2.5:Threedistributionswiththesamemeanandmedian
Spread
Anotherverypopularquestionregardingunivariatedatais,Howvariablearethedata
points?orHowspreadoutordispersedaretheobservations?Toanswerthesequestions,
wehavetomeasurethespread,ordispersion,ofadatasample.
Thesimplestwaytoanswerthatquestionistotakethesmallestvalueinthedatasetand
subtractitbythelargestvalue.Thiswillgiveyoutherange.However,thissuffersfroma
problemsimilartotheissueofthemean.Therangeinsalariesatthelawfirmwillvary
widelydependingonwhethertheCEOisincludedintheset.Further,therangeisjust
dependentontwovalues,thehighestandlowest,andtherefore,can’tspeakofthe
dispersionofthebulkofthedataset.
Onetacticthatsolvesthefirstoftheseproblemsistousetheinterquartilerange.
Note
Whataboutmeasuresofspreadforcategoricaldata?
Themeasuresofspreadthatwetalkaboutinthissectionareonlyapplicabletonumeric
data.Thereare,however,measuresofspreadordiversityofcategoricaldata.Inspiteof
theusefulnessofthesemeasures,thistopicgoesunmentionedorblithelyignoredinmost
dataanalysisandstatisticstexts.Thisisalongandvenerabletraditionthatwewill,forthe
mostpart,adheretointhisbook.Ifyouareinterestedinlearningmoreaboutthis,search
for‘DiversityIndices’ontheweb.
Rememberwhenwesaidthatthemediansplitasorteddatasetintotwoequalparts,and
thatitwasthe50thpercentilebecause50percentoftheobservationsfellbelowitsvalue?
Inasimilarway,ifyouweretodivideasorteddatasetintofourequalparts,orquartiles,
thethreevaluesthatmakethesedivideswouldbethefirst,second,andthirdquartiles
respectively.Thesevaluescanalsobecalledthe25th,50th,and75thpercentiles.Notethat
thesecondquartile,the50thpercentile,andthemedianareallequivalent.
Theinterquartilerangeisthedifferencebetweenthethirdandfirstquartiles.Ifyouapply
theinterquartilerangetoasampleofsalariesatthelawfirmthatincludestheCEO,the
enormoussalarywillbediscardedwiththehighest25percentofthedata.However,this
stillonlyusestwovalues,anddoesn’tspeaktothevariabilityofthemiddle50percent.
Well,onewaywecanuseallthedatapointstoinformourspreadmetricisbysubtracting
eachelementofadatasetfromthemeanofthedataset.Thiswillgiveusthedeviations,or
residuals,fromthemean.Ifweaddupallthesedeviations,wewillarriveatthesumofthe
deviationsfromthemean.Trytofindthesumofthedeviationsfromthemeaninthisset:
{1,3,5,6,7}.
Ifwetrytocomputethis,wenoticethatthepositivedeviationsarecancelledoutbythe
negativedeviations.Inordertocopewiththis,weneedtotaketheabsolutevalue,orthe
magnitudeofthedeviation,andsumthem.
Thisisagreatstart,butnotethatthismetrickeepsincreasingifweaddmoredatatothe
set.Becauseofthis,wemaywanttotaketheaverageofthesedeviations.Thisiscalled
theaveragedeviation.
Forthosehavingtroublefollowingthedescriptioninwords,theformulaforaverage
deviationfromthemeanisthefollowing:
whereµisthemean,Nisthenumberelementsofthesample,and istheithelementof
thedataset.ItcanalsobeexpressedinRasfollows:
>sum(abs(x-mean(x)))/length(x)
Thoughaveragedeviationisanexcellentmeasureofspreadinitsownright,itsuseis
commonly—andsometimesunfortunately—supplantedbytwoothermeasures.
Insteadoftakingtheabsolutevalueofeachresidual,wecanachieveasimilaroutcomeby
squaringeachdeviationfromthemean.This,too,ensuresthateachresidualispositive(so
thatthereisnocancellingout).Additionally,squaringtheresidualshasthesometimes
desirablepropertyofmagnifyinglargerdeviationsfromthemean,whilebeingmore
forgivingofsmallerdeviations.Thesumofthesquareddeviationsiscalled(youguessed
it!)thesumofsquareddeviationsfromthemeanor,simply,sumofsquares.Theaverage
ofthesumofsquareddeviationsfromthemeanisknownasthevarianceandisdenoted
by
.
Whenwesquareeachdeviation,wealsosquareourunits.Forexample,ifourdatasetheld
measurementsinmeters,ourvariancewouldbeexpressedintermsofmeterssquared.To
getbackouroriginalunits,wehavetotakethesquarerootofthevariance:
Thisnewmeasure,denotedbyσ,isthestandarddeviation,anditisoneofthemost
importantmeasuresinthisbook.
Notethatweswitchedfromreferringtothemeanas toreferringitasµ.Thiswasnota
mistake.
Rememberthat wasthesamplemean,andµrepresentedthepopulationmean.The
precedingequationsuseµtoillustratethattheseequationsarecomputingthespread
metricsonthepopulationdataset,andnotonasample.Ifwewanttodescribethe
varianceandstandarddeviationofasample,weusethesymbols andsinsteadof
andσrespectively,andourequationschangeslightly:
Insteadofdividingoursumofsquaresbythenumberofelementsintheset,wearenow
dividingitbyn-1.Whatgives?
Toanswerthatquestion,wehavetolearnalittlebitaboutpopulations,samples,and
estimation.
Populations,samples,andestimation
Oneofthecoreideasofstatisticsisthatwecanuseasubsetofagroup,studyit,andthen
makeinferencesorconclusionsaboutthatmuchlargergroup.
Forexample,let’ssaywewantedtofindtheaverage(mean)weightofallthepeoplein
Germany.Onewaydotothisistovisitallthe81millionpeopleinGermany,recordtheir
weights,andthenfindtheaverage.However,itisafarmoresaneendeavortotakedown
theweightsofonlyafewhundredGermans,andusethosetodeducetheaverageweight
ofallGermans.Inthiscase,thefewhundredpeoplewedomeasureisthesample,andthe
entiretyofpeopleinGermanyiscalledthepopulation.
Now,thereareGermansofallshapesandsizes:someheavier,somelighter.Ifweonly
pickafewGermanstoweigh,weruntheriskof,bychance,choosingagroupofprimarily
underweightGermansoroverweightones.Wemightthencometoaninaccurate
conclusionabouttheweightofallGermans.But,asweaddmoreGermanstooursample,
thosechancevariationstendtobalancethemselvesout.
Allthingsbeingequal,itwouldbepreferabletomeasuretheweightsofallGermansso
thatwecanbeabsolutelysurethatwehavetherightanswer,butthatjustisn’tfeasible.If
wetakealargeenoughsample,though,andarecarefulthatoursampleiswellrepresentativeofthepopulation,notonlycanwegetextraordinarilyclosetotheactual
averageweightofthepopulation,butwecanquantifyouruncertainty.ThemoreGermans
weincludeinoursample,thelessuncertainweareaboutourestimateofthepopulation.
Intheprecedingcase,weareusingthesamplemeanasanestimatorofthepopulation
mean,andtheactualvalueofthesamplemeaniscalledourestimate.Itturnsoutthatthe
formulaforpopulationmeanisagreatestimatorofthemeanofthepopulationwhen
appliedtoonlyasample.Thisiswhywemakenodistinctionbetweenthepopulationand
samplemeans,excepttoreplacetheµwith .Unfortunately,thereexistsnoperfect
estimatorforthestandarddeviationofapopulationforallpopulationtypes.Therewill
alwaysbesomesystematicdifferenceintheexpectedvalueoftheestimatorandthereal
valueofthepopulation.Thismeansthatthereissomebiasintheestimator.Fortunately,
wecanpartiallycorrectit.
Notethatthetwodifferencesbetweenthepopulationandthesamplestandarddeviation
arethat(a)theµisreplacedby inthesamplestandarddeviation,and(b)thedivisornis
replacedbyn-1.
Inthecaseofthestandarddeviationofthepopulation,weknowthemeanµ.Inthecaseof
thesample,however,wedon’tknowthepopulationmean,weonlyhaveanestimateofthe
populationmeanbasedonthesamplemean .Thismustbetakenintoaccountand
correctedinthenewequation.Nolongercanwedividebythenumberofelementsinthe
dataset—wehavetodividebythedegreesoffreedom,whichisn-1.
Note
Whatintheworldaredegreesoffreedom?Andwhyisitn-1?
Let’ssayweweregatheringapartyofsixtoplayaboardgame.Inthisboardgame,each
playercontrolsoneofsixcoloredpawns.Peoplestarttojoininattheboard.Thefirst
personattheboardgetstheirpickoftheirfavoritecoloredpawn.Thesecondplayerhas
onelesspawntochoosefrom,butshestillhasachoiceinthematter.Bythetimethelast
personjoinsinatthegametable,shedoesn’thaveachoiceinwhatpawnsheuses;sheis
forcedtousethelastremainingpawn.Theconceptofdegreesoffreedomisalittlelike
this.
Ifwehaveagroupoffivenumbers,butholdthemeanofthosenumbersfixed,allbutthe
lastnumbercanvary,becausethelastnumbermusttakeonthevaluethatwillsatisfythe
fixedmean.Weonlyhavefourdegreesoffreedominthiscase.
Moregenerally,thedegreesoffreedomisthesamplesizeminusthenumberofparameters
estimatedfromthedata.Whenweareusingthemeanestimateinthestandarddeviation
formula,weareeffectivelykeepingoneoftheparametersoftheformulafixed,sothat
onlyn-1observationsarefreetovary.Thisiswhythedivisorofthesamplestandard
deviationformulaisn-1;itisthedegreesoffreedomthatwearedividingby,notthe
samplesize.
Ifyouthoughtthatthelastfewparagraphswereheadyandtheoretical,you’reright.Ifyou
areconfused,particularlybytheconceptofdegreesoffreedom,youcantakesolaceinthe
factthatyouarenotalone;degreesoffreedom,bias,andsubtletiesofpopulationvs.
samplestandarddeviationarenotoriouslyconfusingtopicsfornewcomerstostatistics.
Butyouonlyhavetolearnitonlyonce!
Probabilitydistributions
Upuntilthispoint,whenwespokeofdistributions,wewerereferringtofrequency
distributions.However,whenwetalkaboutdistributionslaterinthebook—orwhenother
dataanalystsrefertothem—wewillbetalkingaboutprobabilitydistributions,whichare
muchmoregeneral.
It’seasytoturnacategorical,discrete,ordiscretizedfrequencydistributionintoa
probabilitydistribution.Asanexample,refertothefrequencydistributionofcarburetors
inthefirstimageinthischapter.InsteadofaskingWhatnumberofcarshavennumberof
carburetors?,wecanask,Whatistheprobabilitythat,ifIchooseacaratrandom,Iwill
getacarwithncarburetors?
Wewilltalkmoreaboutprobability(anddifferentinterpretationsofprobability)in
Chapter4,Probabilitybutfornow,probabilityisavaluebetween0and1(or0percent
and100percent)thatmeasureshowlikelyaneventistooccur.Toanswerthequestion
What’stheprobabilitythatIwillpickacarwith4carburetors?,theequationis:
Youcanfindtheprobabilityofpickingacarofanyoneparticularnumberofcarburetors
asfollows:
>table(mtcars$carb)/length(mtcars$carb)
123468
0.218750.312500.093750.312500.031250.03125
Insteadofmakingabarchartofthefrequencies,wecanmakeabarchartofthe
probabilities.
Thisiscalledaprobabilitymassfunction(PMF).Itlooksthesame,butnowitmaps
fromcarburetorstoprobabilities,notfrequencies.Figure2.6arepresentsthis.
And,justasitiswiththebarchart,wecaneasilytellthat2and4arethenumberof
carburetorsmostlikelytobechosenatrandom.
Wecoulddothesamewithdiscretizednumericvariablesaswell.Thefollowingimages
arearepresentationofthetemperaturehistogramasaprobabilitymassfunction.
Figure2.6a:Probabilitymassfunctionofnumberofcarburetors
Figure2.6b:ProbabilitymassfunctionofdailytemperaturemeasurementsfromMayto
SeptemberinNY
NotethatthisPMFonlydescribesthetemperaturesofNYCinthedatawehave.
There’saproblemhere,though—thisPMFiscompletelydependentonthesizeofbins
(ourmethodofdiscretizingthetemperatures).Imaginethatweconstructedthebinssuch
thateachbinheldonlyonetemperaturewithinadegree.Inthiscase,wewouldn’tbeable
totellverymuchfromthePMFatall,sinceeachspecificdegreeonlyoccursafewtimes,
ifany,inthedataset.Thesameproblem—butworse!—happenswhenwetrytodescribe
continuousvariableswithprobabilitieswithoutdiscretizingthematall.Imaginetryingto
visualizetheprobability(orthefrequency)ofthetemperaturesiftheyweremeasuredto
thethousandthplace(forexample,{90.167,67.361,..}).Therewouldbenovisible
barsatall!
Whatweneedhereisaprobabilitydensityfunction(PDF).Aprobabilitydensity
functionwilltellustherelativelikelihoodthatwewillexperienceacertaintemperature.
ThenextimageshowsaPDFthatfitsthetemperaturedatathatwe’vebeenplayingwith;it
isanalogousto,butbetterthan,thehistogramwesawinthebeginningofthechapterand
thePMFintheprecedingfigure.
Thefirstthingyou’llnoticeaboutthisnewplotisthatitissmooth,notjaggedorboxylike
thehistogramandPMFs.Thisshouldintuitivelymakemoresense,becausetemperatures
areacontinuousvariable,andthereislikelytobenosharpcutoffsintheprobabilityof
experiencingtemperaturesfromonedegreetothenext.
Figure2.7:Threedistributionswiththesamemeanandmedian
Thesecondthingyoushouldnoticeisthattheunitsandthevaluesontheyaxishave
changed.Theyaxisnolongerrepresentsprobabilities—itnowrepresentsprobability
densities.Thoughitmaybetempting,youcan’tlookatthisfunctionandanswerthe
questionWhatistheprobabilitythatitwillbeexactly80degrees?.Technically,the
probabilityofitbeing80.0000exactlyismicroscopicallysmall,almostzero.Butthat’s
okay!Remember,wedon’tcarewhattheprobabilityofexperiencingatemperatureof
80.0000is—wejustcaretheprobabilityofatemperaturearoundthere.
WecananswerthequestionWhat’stheprobabilitythatthetemperaturewillbebetweena
particularrange?.Theprobabilityofexperiencingatemperature,say80to90degrees,is
theareaunderthecurvefrom80to90.Thoseofyouunfortunatereaderswhoknow
calculuswillrecognizethisastheintegral,oranti-derivative,ofthePDFevaluatedover
therange,
wheref(x)istheprobabilitydensityfunction.
Thenextimageshowstheareaunderthecurveforthisrangeinpink.Youcanimmediately
seethattheregioncoversalotofarea—perhapsonethird.AccordingtoR,it’sabout34
percent.
>temp.density<-density(airquality$Temp)
>pdf<-approxfun(temp.density$x,temp.density$y,rule=2)
>integrate(pdf,80,90)
0.3422287withabsoluteerror<7.5e-06
Figure2.8:PDFwithhighlightedinterval
Wedon’tgetaprobabilitydensityfunctionfromthesampleforfree.ThePDFhastobe
estimated.ThePDFisn’tsomuchtryingtoconveytheinformationaboutthesamplewe
haveasattemptingtomodeltheunderlyingdistributionthatgaverisetothatsample.
Todothis,weuseamethodcalledkerneldensityestimation.Thespecificsofkernel
densityestimationarebeyondthescopeofthisbook,butyoushouldknowthatthedensity
estimationisheavilygovernedbyaparameterthatcontrolsthesmoothnessofthe
estimation.Thisiscalledthebandwidth.
Howdowechoosethebandwidth?Well,it’sjustlikechoosingthesizetomakethebinsin
ahistogram:there’snorightanswer.It’sabalancingactbetweenreducingchanceornoise
inthemodelandnotlosingimportantinformationbysmoothingoverpertinent
characteristicsofthedata.Thisisatradeoffwewillseetimeandtimeagainthroughout
thistext.
Anyway,thegreatthingaboutPDFsisthatyoudon’thavetoknowcalculustointerpret
PDFs.NotonlyarePDFsausefultoolanalytically,buttheymakeforatop-notch
visualizationoftheshapeofdata.
Note
Bytheway…
Rememberwhenweweretalkingaboutmodes,andIsaidthatfindingthemodeofnondiscretizedcontinuouslydistributeddataisamorecomplicatedprocedurethanfor
discretizedorcategoricaldata?Themodeforthesetypesofunivariatedataisthepeakof
thePDF.So,inthetemperatureexample,themodeisaround80degrees.
Figure2.9:Threedifferentbandwidthsusedonthesamedata.
Visualizationmethods
Inanearlierimage,wesawthreeverydifferentdistributions,allwiththesamemeanand
median.Isaidthenthatweneedtoquantifyvariancetotellthemapart.Inthefollowing
image,therearethreeverydifferentdistributions,allwiththesamemean,median,and
variance.
Figure2.10:ThreePDFswiththesamemean,median,andstandarddeviation
Ifyoujustrelyonbasicsummarystatisticstounderstandunivariatedata,you’llneverget
thefullpicture.It’sonlywhenwevisualizeitthatwecanclearlysee,ataglance,whether
thereareanyclustersorareaswithahighdensityofdatapoints,thenumberofclusters
thereare,whetherthereareoutliers,whetherthereisapatterntotheoutliers,andsoon.
Whendealingwithunivariatedata,theshapeisthemostimportantpart(that’swhythis
chapteriscalledShapeofData!).
Wewillbeusingggplot2’sqplotfunctiontoinvestigatetheseshapesandvisualizethese
data.qplot(forquickplot)isthesimplercousinofthemoreexpressiveggplotfunction.
qplotmakesiteasytoproducehandsomeandcompellinggraphicsusingconsistent
grammar.Additionally,muchoftheskills,lessons,andknow-howfromqplotare
transferrabletoggplot(forwhenwehavetogetmoreadvanced).
Note
What’sggplot2?Whyareweusingit?
ThereareafewplottingmechanismsforR,includingthedefaultonethatcomeswithR
(calledbaseR).However,ggplot2seemstobealotofpeople’sfavorite.Thisisnot
unwarranted,givenitswideuse,excellentdocumentation,andconsistentgrammar.
SincethebaseRgraphicssubsystemiswhatIlearnedtowieldfirst,I’vebecomeadeptat
usingit.TherearecertaintypesofplotsthatIproducefasterusingbaseR,soIstilluseit
onaregularbasis(Figure2.8toFigure2.10weremadeusingbaseR!).
Thoughwewillbeusingggplot2forthisbook,feelfreetogoyourownwaywhen
makingyourveryownplots.
Mostofthegraphicsinthissectionaregoingtotakethefollowingform:
>qplot(column,data=dataframe,geom=...)
wherecolumnisaparticularcolumnofthedataframedataframe,andthegeomkeyword
argumentspecifiesageometricobject—itwillcontrolthetypeofplotthatwewant.For
visualizingunivariatedata,wedon’thavemanyoptionsforgeom.Thethreetypesthatwe
willbeusingarebar,histogram,anddensity.Makingabargraphofthefrequency
distributionofthenumberofcarburetorscouldn’tbeeasier:
>library(ggplot2)
>qplot(factor(carb),data=mtcars,geom="bar")
Figure2.11:Frequencydistributionofthenumberofcarburetors
Usingthefactorfunctiononthecarbcolumnmakestheplotlookbetterinthiscase.
Wecould,ifwewantedto,makeanunattractiveanddistractingplotbycoloringallthe
barsadifferentcolor,asfollows:
>qplot(factor(carb),
+data=mtcars,
+geom="bar",
+fill=factor(carb),
+xlab="numberofcarburetors")
Figure2.12:Withcolorandlabelmodification
Wealsorelabeledthexaxis(whichisautomaticallysetbyqplot)tomoreinformative
text.
It’sjustaseasytomakeahistogramofthetemperaturedata—themaindifferenceisthat
weswitchgeomfrombartohistogram:
>qplot(Temp,data=airquality,geom="histogram")
Figure2.13:Histogramoftemperaturedata
Whydoesn’titlooklikethefirsthistograminthebeginningofthechapter,youask?Well,
that’sbecauseoftworeasons:
Iadjustedthebinwidth(sizeofthebins)
Iaddedcolortotheoutlineofthebars
ThecodeIusedforthefirsthistogramlookedasfollows:
>qplot(Temp,data=airquality,geom="histogram",
+binwidth=5,color=I("white"))
MakingplotsoftheapproximationofthePDFaresimilarlysimple:
>qplot(Temp,data=airquality,geom="density")
Figure2.14:PDFoftemperaturedata
Byitself,Ithinktheprecedingplotisratherunattractive.Wecangiveitalittlemoreflair
by:
Fillingthecurvepink
Addingalittletransparencytothefill
>qplot(Temp,data=airquality,geom="density",
+adjust=.5,#changesbandwidth
+fill=I("pink"),
+alpha=I(.5),#addstransparency
+main="densityplotoftemperaturedata")
Figure2.15:Figure2.14withmodifications
Nowthat’sahandsomeplot!
Noticethatwealsomadethebandwidthsmallerthanthedefault(1,whichmadethePDF
moresquiggly)andaddedatitletotheplotwiththemainfunction.
Exercises
Hereareafewexercisesforyoutorevisetheconceptslearnedinthischapter:
WriteanRfunctiontocomputetheinterquartilerange.
Learnaboutwindorized,geometric,harmonic,andtrimmedmeans.Towhatextent
dothesemetricssolvetheproblemofthenon-robustnessofthearithmeticmean?
CraftanassessmentofVirginiaWoolf’simpactonfemininediscourseinthe20th
century.Besuretoaddressbothprosaicandlyricalformsinyourresponse.
Summary
Oneofthehardestthingsaboutdataanalysisisstatistics,andoneofthehardestthings
aboutstatistics(notunlikecomputerprogramming)isthatthebeginningisthetoughest
hurdle,becausetheconceptsaresonewandunfamiliar.Asaresult,somemightfindthis
tobeoneofthemorechallengingchaptersinthistext.
However,hardworkduringthisphasepaysenormousdividends;itprovidesasturdy
foundationonwhichtopileonandorganizenewknowledge.
Torecap,inthischapter,welearnedaboutunivariatedata.Wealsolearnedabout:
Thetypesofunivariatedata
Howtomeasurethecentraltendencyofthesedata
Howtomeasurethespreadofthesedata
Howtovisualizetheshapeofthesedata
Alongtheway,wealsolearnedalittlebitaboutprobabilitydistributionsand
population/samplestatistics.
I’mgladyoumadeitthrough!Relax,makeyourselfamocktail,andI’llseeyouat
Chapter3,DescribingRelationshipsshortly!
Chapter3.DescribingRelationships
Istherearelationshipbetweensmokingandlungcancer?Dopeoplewhocarefordogs
livelonger?Isyouruniversity’sadmissionsdepartmentsexist?
Tacklingtheseexcitingquestionsisonlypossiblewhenwetakeastepbeyondsimply
describingunivariatedatasets—onestepbeyond!
Multivariatedata
Inthischapter,wearegoingtodescriberelationships,andbeginworkingwith
multivariatedata,whichisafancywayofsayingsamplescontainingmorethanone
variable.
Thetroublemakerreadermightremarkthatallthedatasetsthatwe’veworkedwiththus
far(mtcarsandairquality)havecontainedmorethanonevariable.Thisistechnically
true—butonlytechnically.Thefactofthematteristhatwe’veonlybeenworkingwith
oneofthedataset’svariablesatanyonetime.Notethatmultivariateanalyticsisnotthe
sameasdoingunivariateanalyticsonmorethanonevariable–multivariateanalysesand
describingrelationshipsinvolveseveralvariablesatthesametime.
Toputthismoreconcretely,inthelastchapterwedescribedtheshapeof,say,the
temperaturereadingsintheairqualitydataset.
>head(airquality)
OzoneSolar.RWindTempMonthDay
1411907.46751
2361188.07252
31214912.67453
41831311.56254
5NANA14.35655
628NA14.96656
Inthischapter,wewillbeexploringwhetherthereisarelationshipbetweentemperature
andthemonthinwhichthetemperaturewastaken(spoileralert:thereis!).
Thekindofmultivariateanalysisyouperformisheavilyinfluencedbythetypeofdata
thatyouareworkingwith.Therearethreebroadclassesofbivariate(ortwovariable)
relationships:
Therelationshipbetweenonecategoricalvariableandonecontinuousvariable
Therelationshipbetweentwocategoricalvariables
Therelationshipbetweentwocontinuousvariables
Wewillgetintoalloftheseinthenextthreesections.Inthesectionafterthat,wewill
touchondescribingtherelationshipsbetweenmorethantwovariables.Finally,following
inthetraditionofthepreviouschapter,wewillendwithasectiononhowtocreateyour
ownplotstocapturetherelationshipsthatwe’llbeexploring.
Relationshipsbetweenacategoricalanda
continuousvariable
Describingtherelationshipbetweencategoricalandcontinuousvariablesisperhapsthe
mostfamiliarofthethreebroadcategories.
WhenIwasinthefifthgrade,myclasshadtoparticipateinanarea-widesciencefair.We
weretodeviseourownexperiment,performit,andthenpresentit.Forsomereason,in
myexperimentIchosetowatersomelentilsproutswithtapwaterandsomewithalcohol
toseeiftheygrewdifferently.
WhenImeasuredtheheightsandcomparedthemeasurementsoftheteetotallerlentils
versusthedrunkenlentils,Iwaspointingoutarelationshipbetweenacategoricalvariable
(alcohol/no-alcohol)andacontinuousvariable(heightsoftheseedlings).
Note
NotethatIwasn’ttryingtomakeabroaderstatementabouthowalcoholaffectsplant
growth.Inthegrade-schoolexperiment,Iwasjustsummarizingthedifferencesinthe
heightsofthoseplants—theonesthatwereintheexperiment.Inordertomakestatements
ordrawconclusionsabouthowalcoholaffectsplantgrowthingeneral,wewouldbe
exitingtherealmofexploratorydataanalysisandenteringthedomainofinferential
statistics,whichwewilldiscussinthenextunit.
Thealcoholcouldhavemadethelentilsgrowfaster(itdidn’t),growslower(itdid),or
growatthesamerateasthetapwaterlentils.Allthreeofthesepossibilitiesconstitutea
relationship:greaterthan,lessthan,orequalto.
TodemonstratehowtouncovertherelationshipbetweenthesetwotypesofvariablesinR,
wewillbeusingtheirisdatasetthatisconvenientlybuiltrightintoR.
>head(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
Thisisafamousdatasetandisusedtodayprimarilyforteachingpurposes.Itgivesthe
lengthsandwidthsofthepetalsandsepals(anotherpartoftheflower)of150Irisflowers.
Ofthe150flowers,ithas50measurementseachfromthreedifferentspeciesofIris
flowers:setosa,versicolor,andvirginica.
Bynow,weknowhowtotakethemeanofallthepetallengths:
>mean(iris$Petal.Length)
[1]3.758
Butwecouldalsotakethemeanofthepetallengthsofeachofthethreespeciestoseeif
thereisanydifferenceinthemeans.
Naively,onemightapproachthistaskinRasfollows:
>mean(iris$Petal.Length[iris$Species=="setosa"])
[1]1.462
>mean(iris$Petal.Length[iris$Species=="versicolor"])
[1]4.26
>mean(iris$Petal.Length[iris$Species=="virginica"])
[1]5.552
But,asyoumightimagine,thereisafareasierwaytodothis:
>by(iris$Petal.Length,iris$Species,mean)
iris$Species:setosa
[1]1.462
-------------------------------------------iris$Species:versicolor
[1]4.26
-------------------------------------------iris$Species:virginica
[1]5.552
byisahandyfunctionthatappliesafunctiontosplitthesubsetsofdata.Inthiscase,the
Petal.Lengthvectorisdividedintothreesubsetsforeachspecies,andthenthemean
functioniscalledoneachofthosesubsets.Itappearsasifthesetosasinthissamplehave
wayshorterpetalsthantheothertwospecies,withthevirginicasamples’petallength
beatingoutversicolor’sbyasmallermargin.
Althoughmeansareprobablythemostcommonstatistictobecomparedbetween
categories,itisnottheonlystatisticwecanusetocompare.Ifwehadreasontobelieve
thatthevirginicashaveamorewidelyvaryingpetallengththantheothertwospecies,we
couldpassthesdfunctiontothebyfunctionasfollows:
>by(iris$Petal.Length,iris$Species,sd)
Mostoften,though,wewanttobeabletocomparemanystatisticsbetweengroupsatone
time.Tothisend,it’sverycommontopassinthesummaryfunction:
>by(iris$Petal.Length,iris$Species,summary)
iris$Species:setosa
Min.1stQu.MedianMean3rdQu.Max.
1.0001.4001.5001.4621.5751.900
-----------------------------------------------iris$Species:versicolor
Min.1stQu.MedianMean3rdQu.Max.
3.004.004.354.264.605.10
-----------------------------------------------iris$Species:virginica
Min.1stQu.MedianMean3rdQu.Max.
4.5005.1005.5505.5525.8756.900
Ascommonasthisidiomis,itstillpresentsuswithalotofdenseinformationthatis
difficulttomakesenseofataglance.Itismorecommonstilltovisualizethedifferences
incontinuousvariablesbetweencategoriesusingabox-and-whiskerplot:
Figure3.1:Abox-and-whiskerplotdepictingtherelationshipbetweenthepetallengthsof
thedifferentirisspeciesinirisdataset
Abox-and-whiskerplot(orsimply,aboxplotifyouhaveplacestogo,andyou’reina
rush)displaysastunninglylargeamountofinformationinasinglechart.Eachcategorical
variablehasitsownboxandwhiskers.Thebottomandtopendsoftheboxrepresentthe
firstandthirdquartilerespectively,andtheblackbandinsidetheboxisthemedianfor
thatgroup,asshowninthefollowingfigure:
Figure3.2:Theanatomyofaboxplot
Dependingonwhomyoutalktoandwhatyouusetoproduceyourplots,theedgesofthe
whiskerscanmeanafewdifferentthings.Inmyfavoritevariation(calledTukey’s
variation),thebottomofthewhiskersextendtothelowestdatumwithin1.5timesthe
interquartilerangebelowthebottomofthebox.Similarly,theverytopofthewhisker
representsthehighestdatum1.5interquartilerangesabovethethirdquartile(remember:
interquartilerangeisthethirdquartileminusthefirst).Thisis,coincidentally,the
variationthatggplot2uses.
Thegreatthingaboutboxplotsisthatnotonlydowegetagreatsenseofthecentral
tendencyanddispersionofthedistributionwithinacategory,butwecanalsoimmediately
spottheimportantdifferencesbetweeneachcategory.
Fromtheboxplotinthepreviousimage,it’seasytotellwhatwealreadyknowaboutthe
centraltendencyofthepetallengthsbetweenspecies:thatthesetosasinthissamplehave
theshortestpetals;thatthevirginicahavethelongestonaverage;andthatversicolorsare
inthemiddle,butareclosertothevirginicas.
Inaddition,wecanseethatthesetosashavethethinnestdispersion,andthatthevirginica
havethehighest—whenyoudisregardtheoutlier.
Butremember,wearenotsayinganything,ordrawinganyconclusionsyetaboutIris
flowersingeneral.Inalloftheseanalyses,wearetreatingallthedatawehaveasthe
populationofinterest;inthisexample,the150flowersmeasuredareourpopulationof
interest.
Beforewemoveontothenextbroadcategoryofrelationships,let’slookatthe
airqualitydataset,treatthemonthasthecategoricalvariable,thetemperatureasthe
continuousvariable,andseeifthereisarelationshipbetweentheaveragetemperature
acrossmonths.
>by(airquality$Temp,airquality$Month,mean)
airquality$Month:5
[1]65.54839
--------------------------------------------airquality$Month:6
[1]79.1
--------------------------------------------airquality$Month:7
[1]83.90323
--------------------------------------------airquality$Month:8
[1]83.96774
--------------------------------------------airquality$Month:9
[1]76.9
ThisispreciselywhatwewouldexpectfromacityintheNorthernhemisphere:
Figure3.3:ABoxplotofNYCtemperaturesacrossmonths(MaytoSeptember)
Relationshipsbetweentwocategorical
variables
Describingtherelationshipsbetweentwocategoricalvariablesisdonesomewhatless
oftenthantheothertwobroadtypesofbivariateanalyses,butitisjustasfun(anduseful)!
Toexplorethistechnique,wewillbeusingthedatasetUCBAdmissions,whichcontains
thedataongraduateschoolapplicantstotheUniversityofCaliforniaBerkeleyin1973.
Beforewegetstarted,wehavetowrapthedatasetinacalltodata.frameforcoercingit
intoadataframetypevariable—I’llexplainwhy,soon.
ucba<-data.frame(UCBAdmissions)
>head(ucba)
AdmitGenderDeptFreq
1AdmittedMaleA512
2RejectedMaleA313
3AdmittedFemaleA89
4RejectedFemaleA19
5AdmittedMaleB353
6RejectedMaleB207
Now,whatwewantisacountofthefrequenciesofnumberofstudentsineachofthe
followingfourcategories:
Acceptedfemale
Rejectedfemale
Acceptedmale
Rejectedmale
Doyourememberthefrequencytabulationatthebeginningofthelastchapter?Thisis
similar—exceptthatnowwearedividingthesetbyonemorevariable.Thisisknownas
cross-tabulationorcrosstab.Itisalsosometimesreferredtoasacontingencytable.The
reasonwehadtocoerceUCBAdmissionsintoadataframeisbecauseitwasalreadyinthe
formofacrosstabulation(exceptthatitfurtherbrokethedatadownintothedifferent
departmentsofthegradschool).CheckitoutbytypingUCBAdmissionsattheprompt.
WecanusethextabsfunctioninRtomakeourowncross-tabulations:
#thefirstargumenttoxtabs(theformula)should
#bereadas:frequency*by*GenderandAdmission
>cross<-xtabs(Freq~Gender+Admit,data=ucba)
>cross
Admit
GenderAdmittedRejected
Male11981493
Female5571278
Here,ataglance,wecanseethattherewere1198malesthatwereadmitted,557females
thatwereadmitted,andsoon.
IsthereagenderbiasinUCB’sgraduateadmissionsprocess?Perhaps,butit’shardtotell
fromjustlookingatthe2x2contingencytable.Sure,therearefewerfemalesaccepted
thanmales,buttherearealso,unfortunately,farfewerfemalesthatappliedtoUCBinthe
firstplace.
ToaidusineitherimplicatingUCBofasexistadmissionsmachineorexoneratingthem,it
wouldhelptolookataproportionstable.Usingaproportionstable,wecaneasily
comparetheproportionofthetotalnumberofmaleswhowereacceptedversusthe
proportionofthetotalnumberoffemaleswhowereaccepted.Iftheproportionsaremore
orlessequal,wecanconcludethatgenderdoesnotconstituteafactorinUCB’s
admissionsprocess.Ifthisisthecase,genderandadmissionstatusissaidtobe
conditionallyindependent.
>prop.table(cross,1)
Admit
GenderAdmittedRejected
Male0.44518770.5548123
Female0.30354220.6964578
Note
Whydidwesupply1asanargumenttoprop.table?LookupthedocumentationattheR
prompt.Whenwouldwewanttouseprop.table(cross,2)?
Here,wecanseethatwhile45percentofthemaleswhoappliedwereaccepted,only30
percentofthefemaleswhoappliedwereaccepted.Thisisevidencethattheadmissions
departmentissexist,right?Notsofast,myfriend!
ThisispreciselywhatalawsuitlodgedagainstUCBpurported.Whentheissuewas
lookedintofurther,itwasdiscoveredthat,atthedepartmentlevel,womenandmen
actuallyhadsimilaradmissionsrates.Infact,someofthedepartmentsappearedtohavea
smallbutsignificantbiasinfavorofwomen.CheckoutdepartmentA’sproportiontable,
forexample:
>cross2<-xtabs(Freq~Gender+Admit,data=ucba[ucba$Dept=="A",])
>prop.table(cross2,1)
Admit
GenderAdmittedRejected
Male0.62060610.3793939
Female0.82407410.1759259
Iftherewereanybiasinadmissions,thesedatadidn’tproveit.Thisphenomenon,wherea
trendthatappearsincombinedgroupsofdatadisappearsorreverseswhenbrokendown
intogroupsisknownasSimpson’sParadox.Inthiscase,itwascausedbythefactthat
womentendedtoapplytodepartmentsthatwerefarmoreselective.
ThisisprobablythemostfamouscaseofSimpson’sParadox,anditisalsowhythis
datasetisbuiltintoR.Thelessonhereistobecarefulwhenusingpooleddata,andlook
outforhiddenvariables.
Therelationshipbetweentwocontinuous
variables
Doyouthinkthatthereisarelationshipbetweenwomen’sheightsandtheirweights?If
yousaidyes,congratulations,you’reright!
WecanverifythisassertionbyusingthedatainR’sbuilt-indataset,women,whichholds
theheightandweightof15Americanwomenfromages30to39.
>head(women)
heightweight
158115
259117
360120
461123
562126
663129
>nrow(women)
[1]15
Specifically,thisrelationshipisreferredtoasapositiverelationship,becauseasoneofthe
variableincreases,weexpectanincreaseintheothervariable.
Themosttypicalvisualrepresentationoftherelationshipbetweentwocontinuous
variablesisascatterplot.
Ascatterplotisdisplayedasagroupofpointswhosepositionalongthex-axisis
establishedbyonevariable,andthepositionalongthey-axisisestablishedbytheother.
Whenthereisapositiverelationship,thedots,forthemostpart,startinthelower-left
cornerandextendtotheupper-rightcorner,asshowninthefollowingfigure.Whenthere
isanegativerelationship,thedotsstartintheupper-leftcornerandextendtothelowerrightone.Whenthereisnorelationship,itwilllookasifthedotsareallovertheplace.
Figure3.4:Scatterplotofwomen’sheightsandweights
Themorethedotslookliketheyformastraightline,thestrongeristherelationship
betweentwocontinuousvariablesissaidtobe;themorediffusethepoints,theweakeris
therelationship.Thedotsintheprecedingfigurelookalmostexactlylikeastraightline—
thisisprettymuchasstrongarelationshipastheycome.
Thesekindsofrelationshipsarecolloquiallyreferredtoascorrelations.
Covariance
Asalways,visualizationsaregreat—necessary,even—butonmostoccasions,weare
goingtoquantifythesecorrelationsandsummarizethemwithnumbers.
Thesimplestmeasureofcorrelationthatiswidelyuseisthecovariance.Foreachpairof
valuesfromthetwovariables,thedifferencesfromtheirrespectivemeansaretaken.Then,
thosevaluesaremultiplied.Iftheyarebothpositive(thatis,boththevaluesareabove
theirrespectivemeans),thentheproductwillbepositivetoo.Ifboththevaluesarebelow
theirrespectivemeans,theproductisstillpositive,becausetheproductoftwonegative
numbersispositive.Onlywhenoneofthevaluesisaboveitsmeanwilltheproductbe
negative.
Remember,insamplestatisticswedividebythedegreesoffreedomandnotthesample
size.Notethatthismeansthatthecovarianceisonlydefinedfortwovectorsthathavethe
samelength.
WecanfindthecovariancebetweentwovariablesinRusingthecovfunction.Let’sfind
thecovariancebetweentheheightsandweightsinthedataset,women:
>cov(women$weight,women$height)
[1]69
#theorderweputthetwocolumnsin
#theargumentsdoesn'tmatter
>cov(women$height,women$weight)
[1]69
Thecovarianceispositive,whichdenotesapositiverelationshipbetweenthetwo
variables.
Thecovariance,byitself,isdifficulttointerpret.Itisespeciallydifficulttointerpretinthis
case,becausethemeasurementsusedifferentscales:inchesandpounds.Itisalsoheavily
dependentonthevariabilityineachvariable.
Considerwhathappenswhenwetakethecovarianceoftheweightsinpoundsandthe
heightsincentimeters.
#thereare2.54centimetersineachinch
#changingtheunitstocentimetersincreases
#thevariabilitywithintheheightvariable
>cov(women$height*2.54,women$weight)
[1]175.26
Semanticallyspeaking,therelationshiphasn’tchanged,sowhyshouldthecovariance?
Correlationcoefficients
AsolutiontothisquirkofcovarianceistousePearson’scorrelationcoefficientinstead.
Outsideitscolloquialcontext,whenthewordcorrelationisuttered—especiallyby
analysts,statisticians,orscientists—itusuallyreferstoPearson’scorrelation.
Pearson’scorrelationcoefficientisdifferentfromcovarianceinthatinsteadofusingthe
sumoftheproductsofthedeviationsfromthemeaninthenumerator,itusesthesumof
theproductsofthenumberofstandarddeviationsawayfromthemean.Thesenumber-ofstandard-deviations-from-the-meanarecalledz-scores.Ifavaluehasaz-scoreof1.5,itis
1.5standarddeviationsabovethemean;ifavaluehasaz-scoreof-2,thenitis2standard
deviationsbelowthemean.
Pearson’scorrelationcoefficientisusuallydenotedbyranditsequationisgivenas
follows:
whichisthecovariancedividedbytheproductofthetwovariables’standarddeviation.
Animportantconsequenceofusingstandardizedz-scoresinsteadofthemagnitudeof
distancefromthemeanisthatchangingthevariabilityinonevariabledoesnotchangethe
correlationcoefficient.Nowyoucanmeaningfullycomparevaluesusingtwodifferent
scalesoreventwodifferentdistributions.Thecorrelationbetweenweight/height-in-inches
andweight/height-in-centimeterswillnowbeidentical,becausemultiplicationwith2.54
willnotchangethez-scoresofeachheight.
>cor(women$height,women$weight)
[1]0.9954948
>cor(women$height*2.54,women$weight)
[1]0.9954948
Anotherimportantandhelpfulconsequenceofthisstandardizationisthatthemeasureof
correlationwillalwaysrangefrom-1to1.APearsoncorrelationcoefficientof1will
denoteaperfectlypositive(linear)relationship,arof-1willdenoteaperfectlynegative
(linear)relationship,andarof0willdenoteno(linear)relationship.
Whythelinearqualificationinparentheses,though?
Intuitively,thecorrelationcoefficientshowshowwelltwovariablesaredescribedbythe
straightlinethatfitsthedatamostclosely;thisiscalledaregressionortrendline.Ifthere
isastrongrelationshipbetweentwovariables,buttherelationshipisnotlinear,itcannot
berepresentedaccuratelybyPearson’sr.Forexample,thecorrelationbetween1to100
and100to200is1(becauseitisperfectlylinear),butacubicrelationshipisnot:
>xs<-1:100
>cor(xs,xs+100)
[1]1
>cor(xs,xs^3)
[1]0.917552
Itisstillabout0.92,whichisanextremelystrongcorrelation,butnotthe1thatyoushould
expectfromaperfectcorrelation.
SoPearson’srassumesalinearrelationshipbetweentwovariables.Thereare,however,
othercorrelationcoefficientsthataremoretolerantofnon-linearrelationships.Probably
themostcommonoftheseisSpearman’srankcoefficient,alsocalledSpearman’srho.
Spearman’srhoiscalculatedbytakingthePearsoncorrelationnotofthevalues,butof
theirranks.
Note
What’sarank?
Whenyouassignrankstoavectorofnumbers,thelowestnumbergets1,thesecond
lowestgets2,andsoon.Thehighestdatuminthevectorgetsarankthatisequaltothe
numberofelementsinthatvector.
Inrankings,themagnitudeofthedifferenceinvaluesoftheelementsisdisregarded.
Consideraracetoafinishlineinvolvingthreecars.Let’ssaythatthewinnerinthefirst
placefinishedataspeedthreetimesthatofthecarinthesecondplace,andthecarinthe
secondplacebeatthecarinthethirdplacebyonlyafewseconds.Thedriverofthecar
thatcamefirsthasagoodreasontobeproudofherself,butherrank,1stplace,doesnot
sayanythingabouthowsheeffectivelycleanedthefloorwiththeothertwocandidates.
TryusingR’srankfunctiononthevectorc(8,6,7,5,3,0,9).Nowtryitonthe
vectorc(8,6,7,5,3,-100,99999).Therankingsarethesame,right?
Whenweuseranksinstead,thepairthathasthehighestvalueonboththexandtheyaxis
willbec(1,1),evenifonevariableisanon-linearfunction(cubed,squared,logarithmic,
andsoon)oftheother.ThecorrelationsthatwejusttestedwillbothhaveSpearmanrhos
of1,becausecubingavaluewillnotchangeitsrank.
>xs<-1:100
>cor(xs,xs+100,method="spearman")
[1]1
>cor(xs,xs^3,method="spearman")
[1]1
Figure3.5:Scatterplotofy=x+100withregressionline.randrhoareboth1
Figure3.6:Scatterplotof
withregressionline.ris.92,butrhois1
Let’susewhatwe’velearnedsofartoinvestigatethecorrelationbetweentheweightofa
carandthenumberofmilesitgetstothegallon.Doyoupredictanegativerelationship
(theheavierthecar,thelowerthemilespergallon)?
>cor(mtcars$wt,mtcars$mpg)
[1]-0.8676594
Figure3.7:Scatterplotoftherelationshipbetweentheweightofacaranditsmilesper
gallon
Thatisastrongnegativerelationship.Although,intheprecedingfigure,notethatthedata
pointsaremorediffuseandspreadaroundtheregressionlinethanintheotherplots;this
indicatesasomewhatweakerrelationshipthanwehaveseenthusfar.
Foranevenweakerrelationship,checkoutthecorrelationbetweenwindspeedand
temperatureintheairqualitydatasetasdepictedinthefollowingimage:
>cor(airquality$Temp,airquality$Wind)
[1]-0.4579879
>cor(airquality$Temp,airquality$Wind,method="spearman")
[1]-0.4465408
Figure3.8:Scatterplotoftherelationshipbetweenwindspeedandtemperature
Comparingmultiplecorrelations
Armedwithournewstandardizedcoefficients,wecannoweffectivelycomparethe
correlationsbetweendifferentpairsofvariablesdirectly.
Indataanalysis,itiscommontocomparethecorrelationsbetweenallthenumeric
variablesinasingledataset.WecandothiswiththeirisdatasetusingthefollowingR
codesnippet:
>#havetodrop5thcolumn(speciesisnotnumeric)
>iris.nospecies<-iris[,-5]
>cor(iris.nospecies)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
Sepal.Length1.0000000-0.11756980.87175380.8179411
Sepal.Width-0.11756981.0000000-0.4284401-0.3661259
Petal.Length0.8717538-0.42844011.00000000.9628654
Petal.Width0.8179411-0.36612590.96286541.0000000
Thisproducesacorrelationmatrix(whenitisdonewiththecovariance,itiscalleda
covariancematrix).Itissquare(thesamenumberofrowsandcolumns)andsymmetric,
whichmeansthatthematrixisidenticaltoitstransposition(thematrixwiththeaxes
flipped).Itissymmetrical,becausetherearetwoelementsforeachpairofvariableson
eithersideofthediagonallineof1s.Thediagonallineisall1’s,becauseeveryvariableis
perfectlycorrelatedwithitself.Whicharethemosthighly(positively)correlatedpairsof
variables?Whataboutthemostnegativelycorrelated?
Visualizationmethods
Wearenowgoingtoseehowwecancreatethesekindsofvisualizationsonourown.
Categoricalandcontinuousvariables
Wehaveseenthatboxplotsareagreatwayofcomparingthedistributionofacontinuous
variableacrossdifferentcategories.Asyoumightexpect,boxplotsareveryeasyto
produceusingggplot2.Thefollowingsnippetproducesthebox-and-whiskerplotthatwe
sawearlier,depictingtherelationshipbetweenthepetallengthsofthedifferentirisspecies
intheirisdataset:
>library(ggplot)
>qplot(Species,Petal.Length,data=iris,geom="boxplot",
+fill=Species)
First,wespecifythevariableonthex-axis(theirisspecies)andthenthecontinuous
variableonthey-axis(thepetallength).Finally,wespecifythatweareusingtheiris
dataset,thatwewantaboxplot,andthatwewanttofilltheboxeswithdifferentcolorsfor
eachirisspecies.
Anotherfunwayofcomparingdistributionsbetweenthedifferentcategoriesisbyusing
anoverlappingdensityplot:
>qplot(Petal.Length,data=iris,geom="density",alpha=I(.7),
+fill=Species)
Hereweneedonlyspecifythecontinuousvariable,sincethefillparameterwillbreak
downthedensityplotbyspecies.Thealphaparameteraddstransparencytoshowmore
clearlytheextenttowhichthedistributionsoverlap.
Figure3.9:Overlappingdensityplotofpetallengthofirisflowersacrossspecies
Ifitisnotthedistributionyouaretryingtocomparebutsomekindofsingle-valuestatistic
(likestandarddeviationorsamplecounts),youcanusethebyfunctiontogetthatvalue
acrossallcategories,andthenbuildabarplotwhereeachcategoryisabar,andthe
heightsofthebarsrepresentthatcategory’sstatistic.Forthecodetoconstructabarplot,
referbacktothelastsectioninChapter1,RefresheR.
Twocategoricalvariables
Thevisualizationofcategoricaldataisagrosslyunderstudieddomainand,inspiteof
somefairlypowerfulandcompellingvisualizationmethods,thesetechniquesremain
relativelyunpopular.
Myfavoritemethodforgraphicallyillustratingcontingencytablesistouseamosaicplot.
Tomakemosaicplots,wewillneedtoinstallandloadtheVCD(VisualizingCategorical
Data)package:
>#install.packages("vcd")
>library(vcd)
>
>ucba<-data.frame(UCBAdmissions)
>mosaic(Freq~Gender+Admit,data=ucba,
+shade=TRUE,legend=FALSE)
Figure3.10:AmosaicplotoftheUCBAdmissionsdataset(acrossalldepartments)
Thefirstargumenttothemosaicfunctionisaformula.Thisformulaismeanttoberead
as:displayfrequencybrokendownbygenderandwhethertheapplicantwasadmitted.
shade=TRUEaddsalittlelifetotheplotbyaddingcolorstotheboxes.Thecolorsare
actuallyverymeaningful,asisthelegendweoptednottoshowwiththefinalparameter—
butitsmeaningisbeyondthescopeofthissection.
Themosaicplotrepresentseachcellofa2x2contingencytableasatile;theareaofthe
boxisproportionaltothenumberofobservationsinthatcell.Fromthisplot,wecaneasily
tellthat(a)moremenappliedtoUCBthanwomen,(b)moreapplicantswererejectedthan
accepted,and(c)womenwererejectedatahigherproportionthanmaleapplicants.
Yourememberhowthiswasmisleading,right?Let’slookatthemosaicplotforonly
departmentA:
>mosaic(Freq~Gender+Admit,data=ucba[ucba$Dept=="A",],
+shade=TRUE,legend=FALSE)
Figure3.11:AmosaicplotoftheUCBAdmissionsdatasetfordepartmentA
Hopefully,thisplotmakesthetreacheryofSimpson’sparadoxmoreapparent.Noticehow
therewerefarfewerfemaleapplicantsthanmales,buttheadmissionratesforthefemale
applicantsweremuchhigher.Tryvisualizingthemosaicplotsfortheotherdepartmentsby
yourself!
Twocontinuousvariables
Thecanonicalwayofdisplayingrelationshipsbetweentwocontinuousvariablesisvia
scatterplots.Thescatterplotforthewomen’sheightsandweightsthatwesawearlierin
thischapterwasproducedwiththefollowingRcodesnippet:
>qplot(height,weight,data=women,geom="point")
Whetheryouputheightandweightfirstdependsonwhichvariableyouwanttiedtothe
x-axis.
Whataboutthatfancyregressionline?!,youaskfrantically.ggplot2gracefullyprovides
thisfeaturewithjustafewextracharacters.Thescatterplotoftherelationshipbetweenthe
weightofacaranditsmilespergallonwasproducedasfollows:
>qplot(wt,mpg,data=mtcars,geom=c("point","smooth"),
+method="lm",se=FALSE)'
Here,wearespecifyingthatwewanttwokindsofgeometricobjects,pointandsmooth.
Thelatterisresponsiblefortheregressionline.method="lm"tellsqplotthatwewantto
usealinearmodeltocreatethetrendline.
Ifweleaveoutthemethod,ggplot2willchooseamethodautomatically;inthiscase,it
woulddefaulttoamethodofdrawinganon-lineartrendlinecalledLOESS:
>qplot(wt,mpg,data=mtcars,geom=c("point","smooth"),se=FALSE)
Figure3.12:Ascatterplotoftherelationshipbetweentheweightofacaranditsmilesper
gallon,andatrend-linesmoothedwithLOESS
These=FALSEdirectiveinstructsggplot2nottoplottheestimatesoftheerror.Wewillget
towhatthismeansinalaterchapter.
Morethantwocontinuousvariables
Finally,thereisanexcellentwaytovisualizecorrelationmatricesliketheonewesaw
withtheirisdatasetinthesectionComparingmultiplecorrelations.Todothis,wehave
toinstallandloadthecorrgrampackageasfollows:
>#install.packages("corrgram")
>library(corrgram)
>
>corrgram(iris,lower.panel=panel.conf,upper.panel=panel.pts)
Figure3.13:Acorrgramoftheirisdataset’scontinuousvariables
Withcorrgrams,wecanexploitthefactthecorrelationmatricesaresymmetricalby
packinginmoreinformation.Onthelowerleftpanel,wehavethePearsoncorrelation
coefficients(nevermindthesmallrangesbeneatheachcoefficientfornow).Insteadof
repeatingthesecoefficientsfortheupperrightpanel,wecanshowasmallscatterplotthere
instead.
Wearen’tlimitedtoshowingthecoefficientsandscatterplotsinourcorrgram,though;
therearemanyotheroptionsandconfigurationsavailable:
>corrgram(iris,lower.panel=panel.pie,upper.panel=panel.pts,
+diag.panel=panel.density,
+main=paste0("corrgramofpetalandsepal",
+"measurementsinirisdataset"))
Figure3.14:Anothercorrgramoftheirisdataset’scontinuousvariables
Noticethatthistime,wecanoverlayadensityplotwhereverthereisavariablename(on
thediagonal)—justtogetasenseofthevariables’shapes.Moresaliently,insteadoftext
coefficients,wehavepiechartsinthelower-leftpanel.Thesepiechartsaremeantto
graphicallydepictthestrengthofthecorrelations.
Ifthecolorofthepieisblue(oranyshadethereof),thecorrelationispositive;thebigger
theshadedareaofthepie,thestrongerthemagnitudeofthecorrelation.If,however,the
colorofthepieisredorashadeofred,thecorrelationisnegative,andtheamountof
shadingonthepieisproportionaltothemagnitudeofthecorrelation.
Totopitalloff,weaddedthemainparametertosetthetitleoftheplot.Notetheuseof
paste0sothatIcouldsplitthetitleupintotwolinesofcode.
Togetabettersenseofwhatcorrgramiscapableof,youcanviewalivedemonstrationof
examplesifyouexecutethefollowingattheprompt:
>example(corrgram)
Exercises
Tryoutthefollowingexercisestorevisetheconceptslearnedsofar:
Lookatthedocumentationoncorwithhelp("cor").Youcansee,inadditionto
"pearson"and"spearman",thereisanoptionfor"kendall".LearnaboutKendall’s
tau.Why,andunderwhatconditions,isitconsideredbetterthanSpearman’srho?
Foreachspeciesofiris,findthecorrelationcoefficientbetweenthesepallengthand
width.Arethereanydifferences?Howdidwejustcombinetwodifferenttypesofthe
broadcategoriesofbivariateanalysestoperformacomplexmultivariateanalysis?
Downloadadatasetfromtheweb,orfindanotherbuilt-into-Rdatasetthatsuitsyour
fancy(usinglibrary(help="datasets")).Explorerelationshipsbetweenthe
variablesthatyouthinkmighthavesomeconnection.
GustaveFlaubertiswellunderstoodtobeaclassistmisogynistandthis,ofcourse,
influencedhowhedevelopedthecharacterofEmmaBovary.However,itisnot
uncommonforthereaderstoidentifyandempathizewithher,andtheyareoften
devastatedbythebook’sconclusion.Infact,translatorGeoffreyWallassertsthat
Emmadiesinapainthatisexactlyadjustedtotheintensityofourpreceding
identification.
HowcanthefactthatsomesympathizewithEmmabereconciledwithFlaubert’s
apparentintention?Inyourresponse,assumeapost-structuralistapproachto
authorialintent.
Summary
Thereweremanynewideasintroducedinthischapter,sokudostoyouformakingit
through!You’rewellonthewaytobeingabletotacklesomeextraordinarilyinteresting
problemsonyourown!
Tosummarize,inthischapter,welearnedthattherelationshipsbetweentwovariablescan
bebrokendownintothreebroadcategories.
Forcategorical/continuousvariables,welearnedhowtousethebyfunctiontoretrievethe
statisticsonthecontinuousvariableforeachcategory.Wealsosawhowwecanuseboxand-whiskerplotstovisuallyinspectthedistributionsofthecontinuousvariableacross
categories.
Forcategorical/categoricalconfigurations,weusedcontingencyandproportionstablesto
comparefrequencies.Wealsosawhowmosaicplotscanhelpspotinterestingaspectsof
thedatathatmightbedifficulttodetectwhenjustlookingattherawnumbers.
Forcontinuous/continuousdatawediscoveredtheconceptsofcovarianceandcorrelations
andexploreddifferentcorrelationcoefficientswithdifferentassumptionsaboutthenature
ofthebivariaterelationship.Wealsolearnedhowtheseconceptscouldbeexpandedto
describetherelationshipbetweenmorethantwocontinuousvariables.Finally,welearned
howtousescatterplotsandcorrgramstovisuallydepicttheserelationships.
Withthischapter,we’veconcludedtheunitonexploratorydataanalysis,andwe’llbe
movingontoconfirmatorydataanalysisandinferentialstatistics.
Chapter4.Probability
It’stimeforustoputdescriptivestatisticsdownforthetimebeing.Itwasfunforawhile,
butwe’renolongercontentjustdeterminingthepropertiesofobserveddata;nowwewant
tostartmakingdeductionsaboutdatawehaven’tobserved.Thisleadsustotherealmof
inferentialstatistics.
Indataanalysis,probabilityisusedtoquantifyuncertaintyofourdeductionsabout
unobserveddata.Inthelandofinferentialstatistics,probabilityreignsqueen.Manyregard
herasaharshmistress,butthat’sjustarumor.
Basicprobability
Probabilitymeasuresthelikelinessthataparticulareventwilloccur.When
mathematicians(us,fornow!)speakofanevent,wearereferringtoasetofpotential
outcomesofanexperiment,ortrial,towhichwecanassignaprobabilityofoccurrence.
Probabilitiesareexpressedasanumberbetween0and1(orasapercentageoutof100).
Aneventwithaprobabilityof0denotesanimpossibleoutcome,andaprobabilityof1
describesaneventthatiscertaintooccur.
Thecanonicalexampleofprobabilityatworkisacoinflip.Inthecoinflipevent,thereare
twooutcomes:thecoinlandsonheads,orthecoinlandsontails.Pretendingthatcoins
neverlandontheiredge(theyalmostneverdo),thosetwooutcomesaretheonlyones
possible.Thesamplespace(thesetofallpossibleoutcomes),therefore,is{heads,tails}.
Sincetheentiresamplespaceiscoveredbythesetwooutcomes,theyaresaidtobe
collectivelyexhaustive.
Thesumoftheprobabilitiesofcollectivelyexhaustiveeventsisalways1.Inthisexample,
theprobabilitythatthecoinflipwillyieldheadsoryieldtailsis1;itiscertainthatthecoin
willlandononeofthose.Inafairandcorrectlybalancedcoin,eachofthosetwo
outcomesisequallylikely.Therefore,wesplittheprobabilityequallyamongthe
outcomes:intheeventofacoinflip,theprobabilityofobtainingheadsis0.5,andthe
probabilityoftailsis0.5aswell.Thisisusuallydenotedasfollows:
Theprobabilityofacoinflipyieldingeitherheadsortailslookslikethis:
Andtheprobabilityofacoinflipyieldingbothheadsandtailsisdenotedasfollows:
Thetwooutcomes,inadditiontobeingcollectivelyexhaustive,arealsomutually
exclusive.Thismeansthattheycanneverco-occur.Thisiswhytheprobabilityofheads
andtailsis0;itjustcan’thappen.
Thenextobligatoryapplicationofbeginnerprobabilitytheoryisinthecaseofrollinga
standardsix-sideddie.Intheeventofadieroll,thesamplespaceis{1,2,3,4,5,6}.
Witheveryrollofthedie,wearesamplingfromthisspace.Inthisevent,too,each
outcomeisequallylikely,exceptnowwehavetodividetheprobabilityacrosssix
outcomes.Inthefollowingequation,wedenotetheprobabilityofrollinga1asP(1):
Rollinga1orrollinga2isnotcollectivelyexhaustive(wecanstillrolla3,4,5,or6),but
theyaremutuallyexclusive;wecan’trolla1and2.Ifwewanttocalculatetheprobability
ofeitheroneoftwomutuallyexclusiveeventsoccurring,weaddtheprobabilities:
Whilerollinga1orrollinga2aren’tmutuallyexhaustive,rolling1andnotrollinga1are.
Thisisusuallydenotedinthismanner:
Thesetwoevents—andalleventsthatarebothcollectivelyexhaustiveandmutually
exclusive—arecalledcomplementaryevents.
Ourlastpedagogicalexampleinthebasicprobabilitytheoryisusingadeckofcards.Our
deckhas52cards—4foreachnumberfrom2to10and4eachofJack,Queen,King,and
Ace(noJokers!).Eachofthese4cardsbelongtoonesuit,eitheraHeart,Club,Spadeor
Diamond.Thereare,therefore,13cardsineachsuit.Further,everyHeartandDiamond
cardiscoloredred,andeverySpadeandClubareblack.Fromthis,wecandeducethe
followingprobabilitiesfortheoutcomeofrandomlychoosingacard:
What,then,istheprobabilityofgettingablackcardandanAce?Well,theseeventsare
conditionallyindependent,meaningthattheprobabilityofeitheroutcomedoesnotaffect
theprobabilityoftheother.Incaseslikethese,theprobabilityofeventAandeventBis
theproductoftheprobabilityofAandtheprobabilityofB.Therefore:
Intuitively,thismakessense,becausetherearetwoblackAcesoutofapossible52.
WhatabouttheprobabilitythatwechoosearedcardandaHeart?Thesetwooutcomes
arenotconditionallyindependent,becauseknowingthatthecardisredhasabearingon
thelikelihoodthatthecardisalsoaHeart.Incaseslikethese,theprobabilityofeventA
andBisdenotedasfollows:
WhereP(A|B)meanstheprobabilityofAgivenB.Forexample,ifwerepresentAas
drawingaHeartandBasdrawingaredcard,P(A|B)meanswhat’stheprobabilityof
drawingaheartifweknowthatthecardwedrewwasred?.Sincearedcardisequally
likelytobeaHeartoraDiamond,P(A|B)is0.5.Therefore:
Intheprecedingequation,weusedtheformP(B)P(A|B).HadweusedtheformP(A)
P(B|A),wewouldhavegotthesameanswer:
So,thesetwoformsareequivalent:
Forkicks,let’sdividebothsidesoftheequationbyP(B).Thatyieldsthefollowing
equivalence:
ThisequationisknownasBayes’Theorem.Thisequationisveryeasytoderive,butits
meaningandinfluenceisprofound.Infact,itisoneofthemostfamousequationsinallof
mathematics.
Bayes’Theoremhasbeenappliedtoandprovenusefulinanenormousamountofdifferent
disciplinesandcontexts.ItwasusedtohelpcracktheGermanEnigmacodeduringWorld
WarII,savingthelivesofmillions.Itwasalsousedrecently,andfamously,byNateSilver
tohelpcorrectlypredictthevotingpatternsof49statesinthe2008USpresidential
election.
Atitscore,Bayes’Theoremtellsushowtoupdatetheprobabilityofahypothesisinlight
ofnewevidence.Duetothis,thefollowingformulationofBayes’Theoremisoftenmore
intuitive:
whereHisthehypothesisandEistheevidence.
Let’sseeanexampleofBayes’Theoreminaction!
There’sahotnewrecreationaldrugonthescenecalledAllighate(orAllyforshort).It’s
namedassuchbecauseitmakesitsusersgowildandactlikeanalligator.Sincetheeffect
ofthedrugissodeleterious,veryfewpeopleactuallytakethedrug.Infact,onlyabout1
ineverythousandpeople(0.1%)takeit.
Frightenedbyfear-mongeringlate-nightnews,DaisyGirl,Inc.,atechnologyconsulting
firm,orderedanAllighatetestingkitforallofits200employeessothatitcouldoffer
treatmenttoanyemployeewhohasbeenusingit.Notsparinganyexpense,theybought
thebestkitonthemarket;ithad99%sensitivityand99%specificity.Thismeansthatit
correctlyidentifieddrugusers99outof100times,andonlyfalselyidentifiedanon-user
asauseronceinevery100times.
Whentheresultsfinallycameback,twoemployeestestedpositive.Thoughthetwodenied
usingthedrug,theirsupervisor,Ronald,wasreadytosendthemofftogethelp.Justas
Ronaldwasabouttosendthemoff,Shanice,acleveremployeefromthestatistics
department,cametotheirdefense.
Ronaldincorrectlyassumedthateachoftheemployeeswhotestedpositivewereusingthe
drugwith99%certaintyand,therefore,thechancesthatbothwereusingitwas98%.
Shaniceexplainedthatitwasactuallyfarmorelikelythatneitheremployeewasusing
Allighate.
Howso?Let’sfindoutbyapplyingBayes’theorem!
Let’sfocusonjustoneemployeerightnow;letHbethehypothesisthatoneofthe
employeesisusingAlly,andErepresenttheevidencethattheemployeetestedpositive.
Wewanttosolvetheleftsideoftheequation,solet’spluginvalues.Thefirstpartofthe
rightsideoftheequation,P(PositiveTest|AllyUser),iscalledthelikelihood.The
probabilityoftestingpositiveifyouusethedrugis99%;thisiswhattrippedupRonald—
andmostotherpeoplewhentheyfirstheardoftheproblem.Thesecondpart,P(Ally
User),iscalledtheprior.Thisisourbeliefthatanyonepersonhasusedthedrugbefore
wereceiveanyevidence.Sinceweknowthatonly.1%ofpeopleuseAlly,thiswouldbea
reasonablechoiceforaprior.Finally,thedenominatoroftheequationisanormalizing
constant,whichensuresthatthefinalprobabilityintheequationwilladduptooneofall
possiblehypotheses.Finally,thevaluewearetryingtosolve,P(Allyuser|Positive
Test),istheposterior.Itistheprobabilityofourhypothesisupdatedtoreflectnew
evidence.
Inmanypracticalsettings,computingthenormalizingfactorisverydifficult.Inthiscase,
becausethereareonlytwopossiblehypotheses,beingauserornot,theprobabilityof
findingtheevidenceofapositivetestisgivenasfollows:
Whichis:(.99*.001)+(.01*.999)=0.01098
Pluggingthatintothedenominator,ourfinalansweriscalculatedasfollows:
Notethatthenewevidence,whichfavoredthehypothesisthattheemployeewasusing
Ally,shiftedourpriorbelieffrom.001to.09.Evenso,ourpriorbeliefaboutwhetheran
employeewasusingAllywassoextraordinarilylow,itwouldtakesomeveryverystrong
evidenceindeedtoconvinceusthatanemployeewasanAllyuser.
Ignoringthepriorprobabilityincasesliketheseisknownasbase-ratefallacy.Shanice
assuagedRonald’sembarrassmentbyassuringhimthatitwasaverycommonmistake.
Nowtoextendthistotwoemployees:theprobabilityofanytwoemployeesbothusingthe
drugis,aswenowknow,.01squared,or1milliontoone.Squaringournewposterior
yields,weget.0081.TheprobabilitythatbothemployeesuseAlly,evengiventheir
positiveresults,islessthan1%.So,theyareexonerated.
Sallyisadifferentstory,though.Herfriendsnoticedherbehaviorhaddramatically
changedasoflate—shesnapsatco-workersandhastakentoeatingpencils.Her
concernedcubicle-mateevenfollowedherafterworkandsawhercrawlintoasewer,not
toemergeuntilthenextdaytogobacktowork.
EventhoughSallypassedthedrugtest,weknowthatit’slikely(almostcertain)thatshe
usesAlly.Bayes’theoremgivesusawaytoquantifythatprobability!
Ourprioristhesame,butnowourlikelihoodisprettymuchascloseto1asyoucangetafterall,howmanynon-Allyusersdoyouthinkeatpencilsandliveinsewers?
Ataleoftwointerpretations
Thoughitmayseemstrangetohear,thereisactuallyahotphilosophicaldebateabout
whatprobabilityreallyis.Thoughthereareothers,thetwoprimarycampsintowhich
virtuallyallmathematiciansfallarethefrequentistcampandtheBayesiancamp.
Thefrequentistinterpretationdescribesprobabilityastherelativelikelihoodofobserving
anoutcomeinanexperimentwhenyourepeattheexperimentmultipletimes.Flippinga
coinisaperfectexample;theprobabilityofheadsconvergesto50%asthenumberof
timesitisflippedgoestoinfinity.
Thefrequentistinterpretationofprobabilityisinherentlyobjective;thereisatrue
probabilityoutthereintheworld,whichwearetryingtoestimate.
TheBayesianinterpretation,however,viewsprobabilityasourdegreeofbeliefabout
something.Becauseofthis,theBayesianinterpretationissubjective;whenevidenceis
scarce,therearesometimeswildlydifferentdegreesofbeliefamongdifferentpeople.
Describedinthismanner,Bayesianismmayscaremanypeopleoff,butitisactuallyquite
intuitive.Forexample,whenameteorologistdescribestheprobabilityofrainas70%,
peoplerarelybataneyelash.ButthisnumberonlyreallymakessensewithinaBayesian
frameworkbecauseexactmeteorologicalconditionsarenotrepeatable,asisrequiredby
frequentistprobability.
Notsimplyaheadyacademicexercise,thesetwointerpretationsleadtodifferent
methodologiesinsolvingproblemsindataanalysis.Manytimes,bothapproachesleadto
similarresults.Wewillseeexamplesofusingbothapproachestosolveaproblemlaterin
thisbook.
Thoughpractitionersmaystronglyalignthemselveswithonesideoveranother,good
statisticiansknowthatthere’satimeandaplaceforbothapproaches.
Note
ThoughBayesianismasavalidwayoflookingatprobabilityisdebated,Bayestheoremis
afactaboutprobabilityandisundisputedandnon-controversial.
Samplingfromdistributions
Observingtheoutcomeoftrialsthatinvolvearandomvariable,avariablewhosevalue
changesduetochance,canbethoughtofassamplingfromaprobabilitydistribution—one
thatdescribesthelikelihoodofeachmemberofthesamplespaceoccurring.
Thatsentenceprobablysoundsmuchscarierthanitneedstobe.Takeadierollfor
example.
Figure4.1:Probabilitydistributionofoutcomesofadieroll
Eachrollofadieislikesamplingfromadiscreteprobabilitydistributionforwhicheach
outcomeinthesamplespacehasaprobabilityof0.167or1/6.Thisisanexampleofa
uniformdistribution,becausealltheoutcomesareuniformlyaslikelytooccur.Further,
thereareafinitenumberofoutcomes,sothisisadiscreteuniformdistribution(therealso
existcontinuousuniformdistributions).
Flippingacoinislikesamplingfromauniformdistributionwithonlytwooutcomes.
Morespecifically,theprobabilitydistributionthatdescribescoin-flipeventsiscalleda
Bernoullidistribution—it’sadistributiondescribingonlytwoevents.
Parameters
Weuseprobabilitydistributionstodescribethebehaviorofrandomvariablesbecausethey
makeiteasytocomputewithandgiveusalotofinformationabouthowavariable
behaves.Butbeforeweperformcomputationswithprobabilitydistributions,wehaveto
specifytheparametersofthosedistributions.Theseparameterswilldetermineexactly
whatthedistributionlookslikeandhowitwillbehave.
Forexample,thebehaviorofbotha6-sideddieanda12-sideddieismodeledwitha
uniformdistribution.Eventhoughthebehaviorofboththediceismodeledasuniform
distributions,thebehaviorofeachisalittledifferent.Tofurtherspecifythebehaviorof
eachdistribution,wedetailitsparameter;inthecaseofthe(discrete)uniformdistribution,
theparameteriscalledn.Auniformdistributionwithparameternhasnequallylikely
outcomesofprobability1/n.Thenfora6-sideddieanda12-sideddieis6and12
respectively.
ForaBernoullidistribution,whichdescribestheprobabilitydistributionofaneventwith
onlytwooutcomes,theparameterisp.Outcome1occurswithprobabilityp,andtheother
outcomeoccurswithprobability1-p,becausetheyarecollectivelyexhaustive.Theflip
ofafaircoinismodeledasaBernoullidistributionwithp=0.5.
Imagineasix-sideddiewithonesidelabeled1andtheotherfivesideslabeled2.The
outcomeofthedierolltrialscanbedescribedwithaBernoullidistribution,too!Thistime,
p=0.16(1/6).Therefore,theprobabilityofnotrollinga1is5/6.
Thebinomialdistribution
Thebinomialdistributionisafunone.Likeouruniformdistributiondescribedinthe
previoussection,itisdiscrete.
Whenaneventhastwopossibleoutcomes,successorfailure,thisdistributiondescribes
thenumberofsuccessesinacertainnumberoftrials.Itsparametersaren,thenumberof
trials,andp,theprobabilityofsuccess.
Concretely,abinomialdistributionwithn=1andp=0.5describesthebehaviorofasingle
coinflip—ifwechoosetoviewheadsassuccesses(wecouldalsochoosetoviewtailsas
successes).Abinomialdistributionwithn=30andp=0.5describesthenumberofheads
weshouldexpect.
Figure4.2:Abinomialdistribution(n=30,p=0.5)
Onaverage,ofcourse,wewouldexpecttohave15heads.However,randomnessisthe
nameofthegame,andseeingmoreorfewerheadsistotallyexpected.
Howcanweusethebinomialdistributioninpractice?,youask.Well,let’slookatan
application.
LarrytheUntrustworthyKnave—whocanonlybetrustedsomeofthetime—givesusa
cointhatheallegesisfair.Weflipit30timesandobserve10heads.
Itturnsoutthattheprobabilityofgettingexactly10headson30flipsisabout2.8%*.We
canuseRtotellustheprobabilityofgetting10orfewerheadsusingthepbinomfunction:
>pbinom(10,size=30,prob=.5)
[1]0.04936857
Itappearsasiftheprobabilityofthisoccurring,inacorrectlybalancedcoin,isroughly
5%.DoyouthinkweshouldtakeLarryathisword?
Note
*Ifyou’reinterested
Thewaywedeterminedtheprobabilityofgettingexactly10headsisbyusingthe
probabilityformulaforBernoullitrials.Theprobabilityofgettingksuccessesinntrialsis
equalto:
wherepistheprobabilityofgettingonesuccessand:
Ifyourpalmsaregettingsweaty,don’tworry.Youdon’thavetomemorizethisinorderto
understandanylaterconceptsinthisbook.
Thenormaldistribution
DoyourememberinChapter2,TheShapeofDatawhenwedescribedthenormal
distributionandhowubiquitousitis?Thebehaviorofmanyrandomvariablesinreallife
isverywelldescribedbyanormaldistributionwithcertainparameters.
Thetwoparametersthatuniquelyspecifyanormaldistributionareµ(mu)andσ(sigma).
µ,themean,describeswherethedistribution’speakislocatedandσ,thestandard
deviation,describeshowwideornarrowthedistributionis.
Figure4.3:Normaldistributionswithdifferentparameters
ThedistributionofheightsofAmericanfemalesisapproximatelynormallydistributed
withparametersµ=65inchesandσ=3.5inches.
Figure4.4:Normaldistributionswithdifferentparameters
Withthisinformation,wecaneasilyanswerquestionsabouthowprobableitistochoose,
atrandom,USwomenofcertainheights.
AsmentionedearlierinChapter2,TheShapeofDatawecan’treallyanswerthequestion
Whatistheprobabilitythatwechooseapersonwhoisexactly60inches?,because
virtuallynooneisexactly60inches.Instead,weanswerquestionsabouthowprobableit
isthatarandompersoniswithinacertainrangeofheights.
Whatistheprobabilitythatarandomlychosenwomanis70inchesortaller?Ifyourecall,
theprobabilityofaheightwithinarangeistheareaunderthecurve,ortheintegralover
thatrange.Inthiscase,therangewewillintegratelookslikethis:
Figure4.5:Areaunderthecurveoftheheightdistributionfrom70inchestopositive
infinity
>f<-function(x){dnorm(x,mean=65,sd=3.5)}
>integrate(f,70,Inf)
0.07656373withabsoluteerror<2.2e-06
TheprecedingRcodeindicatesthatthereisa7.66%chanceofrandomlychoosinga
womanwhois70inchesortaller.
Luckilyforus,thenormaldistributionissopopularandwellstudied,thatthereisa
functionbuiltintoR,sowedon’tneedtouseintegrationourselves.
>pnorm(70,mean=65,sd=3.5)
[1]0.9234363
Thepnormfunctiontellsustheprobabilityofchoosingawomanwhoisshorterthan70
inches.IfwewanttofindP(>70inches),wecaneithersubtractthisvalueby1(which
givesusthecomplement)orusetheoptionalargumentlower.tail=FALSE.Ifyoudothis,
you’llseethattheresultmatchesthe7.66%chancewearrivedatearlier.
Thethree-sigmaruleandusingz-tables
Whendealingwithanormaldistribution,weknowthatitismorelikelytoobservean
outcomethatisclosetothemeanthanitistoobserveonethatisdistant—butjusthow
muchmorelikely?Well,itturnsoutthatroughly68%ofallthevaluesdrawnfroma
randomdistributionliewithin1standarddeviation,or1z-score,awayfromthemean.
Expandingourboundaries,wefindthatroughly95%ofallvaluesarewithin2z-scores
fromthemean.Finally,about99.7%ofnormaldeviatesarewithin3standarddeviations
fromthemean.Thisiscalledthethree-sigmarule.
Figure4.6:Thethree-sigmarule
Beforecomputerscameonthescene,findingtheprobabilityofrangesassociatedwith
randomdeviateswasalittlemorecomplicated.Tosavemathematiciansfromhavingto
integratetheGaussian(normal)functionbyhand(eww!),theyusedaz-table,orstandard
normaltable.Thoughusingthismethodtodayis,strictlyspeaking,unnecessary,anditisa
littlemoreinvolved,understandinghowitworksisimportantataconceptuallevel.Notto
mentionthatitgivesyoustreetcredasfarasstatisticiansareconcerned!
Formally,thez-tabletellsusthevaluesofcumulativedistributionfunctionatdifferentzscoresofanormaldistribution.Lessabstractly,thez-tabletellsustheareaunderthecurve
fromnegativeinfinitytocertainz-scores.Forexample,lookingup-1onaz-tablewilltell
ustheareatotheleftof1standarddeviationbelowthemean(15.9%).
Z-tablesonlydescribethecumulativedistributionfunction(areaunderthecurve)ofa
standardnormaldistribution—onewithameanof0andastandarddeviationof1.
However,wecanuseaz-tableonnormaldistributionswithanyparameters,µandσ.All
youneedtodoisconvertavaluefromtheoriginaldistributionintoaz-score.Thisprocess
iscalledstandardization.
Touseaz-tabletofindtheprobabilityofchoosingaUSwomanatrandomwhoistaller
than70inches,wefirsthavetoconvertthisvalueintoaz-score.Todothis,wesubtract
themean(65inches)from70andthendividethatvaluebythestandarddeviation(3.5
inches).
Then,wefind1.43onthez-table;onmostz-tablelayouts,thismeansfindingtherow
labeled1.4(thez-scoreuptothetenthsplace)andthecolumn“.03”(thevalueinthe
hundredthsplace).Thevalueatthisintersectionis.9236,whichmeansthatthe
complement(someonetallerthan70inches)is1-.9236=0.0764.Thisisthesameanswer
wegotwhenweusedintegrationandthepnormfunction.
Exercises
Practisethefollowingexercisestoreinforcetheconceptslearnedinthischapter:
RecallthedrugtestingatDaisyGirl,Inc.earlierinthechapter.Weused.1%asour
priorprobabilitythattheemployeewasusingthedrug.Whyshouldthispriorhave
beenevenlower?UsingasubjectiveBayesianinterpretationofprobability,estimate
whatthepriorshouldhavebeengiventhattheemployeewasabletoholddownajob
andnoonesawher/himactlikeanalligator.
HarkenbacktotheexampleofthecoinfromLarrytheUntrustworthyKnave.We
wouldexpecttheproportionofheadsinafaircointhatisflippedmanytimestobe
around50%.InLarry’scoin,theproportionwas2/3,whichisunlikelytooccur.The
probabilityof20headsin30flipswas2.1%.However,findtheprobabilityofgetting
40headsin60flips.Eventhoughtheproportionsarethesame,whyistheprobability
ofobserving40headsin60flipssosignificantlylessprobable?Understandingthe
answertothisquestioniskeytounderstandingsamplingtheoryandinferentialdata
analysis.
Usethebinomialdistributionandpbinomtocalculatetheprobabilityofobserving10
orfewer“1”swhenrollingafair6-sideddie50times.Viewrollinga“1”asasuccess
andnotrolling“1”asafailure.Whatisthevalueoftheparameter,p?
Useaz-tabletofindtheprobabilityofchoosingaUSwomanatrandomwhois60
inchesorshorter.Whyisthisthesameprobabilityaschoosingonewhois70inches
ortaller?
Supposeatrolleyiscomingdownthetracks,anditsbrakesarenotworking.Itis
poisedtorunoverfivepeoplewhoarehangingoutonthetracksaheadofit.Youare
nexttoaleverthatcanchangethetracksthatthetrolleyisridingon.However,the
secondsetoftrackshasonepersonhangingoutonit,too.
Isitmorallywrongtonotpulltheleversothatonlyonepersonishurt,rather
thanfive?
Howwouldautilitarianrespond?Next,whatwouldThomasAquinassayabout
this?BackupyourthesisbyappealingtotheDoctrineoftheDoubleEffectin
SummaTheologica.Also,whatwouldKantsay?Backupyourresponseby
appealingtothecategoricalimperativeintroducedintheFoundationofthe
MetaphysicofMorals.
Summary
Inthischapter,wetookadetourthroughprobabilityland.Youlearnedsomebasiclawsof
probability,aboutsamplespaces,andconditionalindependence.Youalsolearnedhowto
deriveBayes’Theoremandlearnedthatitprovidestherecipeforupdatinghypothesesin
thelightofnewevidence.
Wealsotoucheduponthetwoprimaryinterpretationsofprobability.Infuturechapters,
wewillbeemployingtechniquesfromboththoseapproaches.
Weconcludedwithanintroductiontosamplingfromdistributionsandusedtwo—the
binomialandthenormaldistributions—toanswerinterestingnon-trivialquestionsabout
probability.
Thischapterlaidtheimportantfoundationthatsupportsconfirmatorydataanalysis.
Makingandcheckinginferencesbasedondataisallaboutprobabilityand,atthispoint,
weknowenoughtomoveontohaveagreattimetestinghypotheseswithdata!
Chapter5.UsingDatatoReasonAbout
theWorld
InChapter4,Probability,wementionedthatthemeanheightofUSfemalesis65inches.
Nowpretendwedidn’tknowthisfact—howcouldwefindoutwhattheaverageheightis?
WecanmeasureeveryUSfemale,butthat’suntenable;wewouldrunoutofmoney,
resources,andtimebeforeweevenfinishedwithasmallcity!
Inferentialstatisticsgivesusthepowertoanswerthisquestionusingaverysmallsample
ofallUSwomen.Wecanusethesampletotellussomethingaboutthepopulationwe
drewitfrom.Wecanuseobserveddatatomakeinferencesaboutunobserveddata.Bythe
endofthischapter,youtoowillbeabletogooutandcollectasmallamountofdataand
useittoreasonabouttheworld!
Estimatingmeans
Intheexamplethatisgoingtospanthisentirechapter,wearegoingtobeexamininghow
wewouldestimatethemeanheightofallUSwomenusingonlysamples.Specifically,we
willbeestimatingthepopulationparametersusingsamples’meansasanestimator.
Iamgoingtousethevectorall.us.womentorepresentthepopulation.Forsimplicity’s
sake,let’ssaythereareonly10,000USwomen.
>#settingseedwillmakerandomnumbergenerationreproducible
>set.seed(1)
>all.us.women<-rnorm(10000,mean=65,sd=3.5)
Wehavejustcreatedavectorof10,000normallydistributedrandomvariableswiththe
sameparametersasourpopulationofinterestusingthernormfunction.Ofcourse,atthis
point,wecanjustcallmeanonthisvectorandcallitaday—butthat’scheating!Weare
goingtoseethatwecangetreallyreallyclosetothepopulationmeanwithoutactually
usingtheentirepopulation.
Now,let’stakearandomsampleoftenfromthispopulationusingthesamplefunctionand
computethemean:
>our.sample<-sample(all.us.women,10)
>mean(our.sample)
[1]64.51365
Hey,notabadstart!
Oursamplewill,inalllikelihood,containsomeshortpeople,somenormalpeople,and
sometallpeople.There’sachancethatwhenwechooseasamplethatwechooseonethat
containspredominatelyshortpeople,oradisproportionatenumberoftallpeople.Because
ofthis,ourestimatewillnotbeexactlyaccurate.However,aswechoosemoreandmore
peopletoincludeinoursample,thosechanceoccurrences—imbalancedproportionsofthe
shortandtall—tendtobalanceeachotherout.
Notethatasweincreaseoursamplesize,thesamplemeanisn’talwaysclosertothe
populationmean,butitwillbecloseronaverage.
Wecantestthatassertionourselves!Studythefollowingcodecarefullyandtryrunningit
yourself.
>population.mean<-mean(all.us.women)
>
>for(sample.sizeinseq(5,30,by=5)){
+#createemptyvectorwith1000elements
+sample.means<-numeric(1000)
+for(iin1:1000){
+sample.means[i]<-mean(sample(all.us.women,sample.size))
+}
+distances.from.true.mean<-abs(sample.means-population.mean)
+mean.distance.from.true.mean<-mean(distances.from.true.mean)
+print(mean.distance.from.true.mean)
+}
[1]1.245492
[1]0.8653313
[1]0.7386099
[1]0.6355692
[1]0.5458136
[1]0.5090788
Foreachsamplesizefrom5to30(goingupby5),wewilltake1000differentsamples
fromthepopulation,calculatetheirmean,takethedifferencesfromthepopulationmean,
andaveragethem.
Figure5.1:Accuracyofsamplemeansasafunctionofsamplesize
Asyoucansee,increasingthesamplesizegetsusclosertothepopulationmean.
Increasingthesamplesizealsoreducesthestandarddeviationbetweenthemeansofthe
samples.
Figure5.2:Thevariabilityofsamplemeansasafunctionofsamplesize
Knowingthat,withallotherthingsbeingequal,largersamplesarepreferabletosmaller
ones,let’sworkwithasamplesizeof40forrightnow.We’lltakeoursampleandestimate
ourpopulationmeanasfollows:
>mean(our.new.sample)
[1]65.19704
Thesamplingdistribution
So,wehaveestimatedthatthetruepopulationmeanisabout65.2;weknowthe
populationmeanisn’texactly65.19704—butbyjusthowmuchmightourestimatebe
off?
Toanswerthisquestion,let’stakerepeatedsamplesfromthepopulationagain.Thistime,
we’regoingtotakesamplesofsize40fromthepopulation10,000timesandplota
frequencydistributionofthemeans.
>means.of.our.samples<-numeric(10000)
>for(iin1:10000){
+a.sample<-sample(all.us.women,40)
+means.of.our.samples[i]<-mean(a.sample)
+}
Figure5.3:Thesamplingdistributionofsamplemeans
Thisfrequencydistributioniscalledasamplingdistribution.Inparticular,sinceweused
samplemeansasthevalueofinterest,thisiscalledthesamplingdistributionofthesample
means(whew!!).Youcancreateasamplingdistributionofanystatistic(median,variance,
andsoon),butwhenwerefertosamplingdistributionsthroughoutthischapter,wewillbe
specificallyreferringtothesamplingdistributionofsamplemeans.
Checkitout:thesamplingdistributionlookslikeanormaldistribution—andthat’s
becauseitisanormaldistribution.
Foralargeenoughsamplesize,thesamplingdistributionofanypopulationwillbe
approximatelynormalwithameanequaltothepopulationmean,µ,andastandard
deviationof:
whereNisthesamplesizeandσisthepopulationstandarddeviation.Thisiscalledthe
centrallimittheorem,anditisamongthemostimportanttheoremsinallofstatistics.
Lookbackattheequation.Convinceyourselfthatsamplesizeisproportionaltothe
narrownessofthesamplingdistributionbynotingthatthesamplesizeisinthe
denominator.
Thestandarddeviationofthesamplingdistributiontellsushowvariableasampleofa
certainsize’smeancanbefromsampletosample.Italsotellsushowmuchweexpect
certainsamples’meanstovaryfromthetruepopulationmean.Thestandarddeviationof
thesamplingdistributioniscalledthestandarderror,andwecanuseittoquantifyour
uncertaintyaboutourestimateofthepopulationmean.
Ifthestandarderrorissmall,anestimatefromonesampleislikelytobeclosertothetrue
mean(becausethesamplingdistributionisnarrow).Ifourstandarderrorisbig,themean
ofanyoneparticularsampleislikelytobefartherawayfromthetruemean,onaverage.
Okay,soI’veconvincedyouthatthestandarderrorisagreatstatistictouse—buthowdo
wegetit?Upuntilnow,I’vesaidthatyoucancalculateitbyeither:
Takingmanymanysamplesfromthepopulationandtakingthestandarddeviationof
thesamplemeans
Dividingthestandarddeviationofthepopulationbythesquarerootofthesample
size
However,inpractice,thisisn’tgoodenough:wedon’twanttotakerepeatedsamplesfrom
thepopulationforthesamereasonthatwecan’tmeasuretheheightsofallUSwomen
(becauseitwouldtaketoolongandcosttoomuch).And,inthecaseofusingthe
populationstandarddeviationtogetthestandarderror—well,wedon’tknowthe
populationstandarddeviation—ifwedid,wewouldhavealreadyhadtocalculatethe
populationmean,andwewouldn’thavetobeestimatingitwithsampling!
Ideally,wewanttofindthestandarderrorusingonlyonesample.Well,itturnsoutthatfor
sufficientlylargesamples,usingthesamplestandarddeviation,s,inthestandarderror
formula(insteadofthepopulationstandarddeviation,σ)isagoodenoughapproximation.
Similarly,themeanofthesamplingdistributionisequaltothepopulationmean,butwe
canuseoursample’smeanasanestimateofthat.
Note
Toreiterate,forasampleofsufficientsize,wecanpretendthatthesamplingdistribution
ofthesamplemeanshasameanequaltothesample’smeanandastandarddeviationof
thesample’sstandarddeviationdividedbythesquarerootofthesamplesize.This
standarddeviationofthesamplingdistributioniscalledthestandarderror,anditisavery
importantnumberforquantifyingtheuncertaintyofourestimationofthepopulationmean
fromthesamplemean.
Foraconcreteexample,let’suseoursampleof40,our.new.sample:
>mean(our.new.sample)
[1]65.19704
>sd(our.new.sample)
[1]3.588447
>sd(our.new.sample)/sqrt(length(our.new.sample))
[1]0.5673833
Oursample’smeanandstandarddeviationis65.2and3.59respectively.Thestandard
errorofthemeanis0.567.
Thismeansthatthesamplingdistributionofthesamplemeanswouldlooksomethinglike
this:
Figure5.4:Estimatedsamplingdistributionofsamplemeansbasedononesample
Intervalestimation
Again,wecareaboutthestandarderror(thestandarddeviationofthesampling
distributionofsamplemeans)becauseitexpressesthedegreeofuncertaintywehavein
ourestimation.Becauseofthis,it’snotuncommonforstatisticianstoreportthestandard
erroralongwiththeirestimate.
What’smorecommon,though,isforstatisticianstoreportarangeofnumberstodescribe
theirestimates;thisiscalledintervalestimation.Incontrast,whenwewerejustproviding
thesamplemeanasourestimateofthepopulationmean,wewereengaginginpoint
estimation.
Onecommonapproachtointervalestimationistouseconfidenceintervals.Aconfidence
intervalgivesusarangeoverwhichasignificantproportionofthesamplemeanswould
fallwhensamplesarerepeatedlydrawnfromapopulationandtheirmeansarecalculated.
Concretely,a95%confidenceintervalistherangethatwouldcontain95%ofthesample
meansifmultiplesamplesweretakenfromthesamepopulation.95%confidenceintervals
areverycommon,but90%and99%confidenceintervalsaren’trare.
Thinkaboutthisforasecond:ifa95%confidenceintervalcontains95%ofthesample
means,thatmeansthatthe95%confidenceintervalcovers95%oftheareaofthe
samplingdistribution.
Figure5.5:The95%confidenceintervalofourestimateofthesamplemean(64.085to
66.31)covers95%oftheareaintheourestimatedsamplingdistribution
Okay,sohowdowefindtheboundsoftheconfidenceinterval?Thinkbacktothethree-zs
rulefromthepreviouschapteronprobability.Recallthatabout95%ofanormal
distribution’sareaiswithintwostandarddeviationsofthemean.Well,iftheboundsofa
confidenceintervalcover95%ofthesamplingdistribution,thentheboundsmustbetwo
standarddeviationsawayfromthemeanonbothsides!Sincethestandarddeviationofthe
distributionofinterest(thesamplingdistributionofsamplemeans)isthestandarderror,
theboundsoftheconfidenceintervalarethemeanminus2timesthestandarderrorand
themeanplus2timesthestandarderror.
Inreality,twostandarddeviations(ortwoz-scores)awayfromthemeancontainalittlebit
morethan95%oftheareaofthedistribution.Tobemoreprecise,therangebetween-1.96
z-scoresand1.96z-scorescontains95%ofthearea.Therefore,theboundsofa95%
confidenceintervalare:
where isthesamplemeanandsisthesamplestandarddeviation.
Inourexample,ourboundsare:
>err<-sd(our.new.sample)/sqrt(length(our.new.sample))
>mean(our.new.sample)-(1.96*err)
[1]64.08497
>mean(our.new.sample)+(1.96*err)
[1]66.30912
Howdidweget1.96?
Youcangetthisnumberyourselfbyusingtheqnormfunction.
Theqnormfunctionisalittleliketheoppositeofthepnormfunctionthatwesawinthe
previouschapter.Thatfunctionstartedwithapbecauseitgaveusaprobability—the
probabilitythatwewouldseeavalueequaltoorbelowitinanormaldistribution.Theq
inqnormstandsforquantile.Aquantile,foragivenprobability,isthevalueatwhichthe
probabilitywillbeequaltoorbelowthatprobability.
Iknowthatwasconfusing!Stateddifferently,butequivalently,aquantileforagiven
probabilityisthevaluesuchthatifweputitinthepnormfunction,wegetbackthatsame
probability.
>qnorm(.025)
[1]-1.959964
>pnorm(-1.959964)
[1]0.025
Weshowedearlierthat95%oftheareaunderacurveofaprobabilitydistributionis
within1.9599z-scoresawayfromthemean.Weput.025intheqnormfunction,becauseif
themeanisrightsmackinthemiddleofthe95%confidenceinterval,thenthereis2.5%
oftheareatotheleftoftheboundand2.5%oftheareatotherightofthebound.Together,
thislower2.5%andupper2.5%makeupthemissing5%ofthearea.
Don’tfeellimitedtothe95%confidenceinterval,though.Youcanfigureoutthebounds
ofa90%confidenceintervalusingjustthesameprocedure.Inanintervalthatcontains
90%oftheareaofacurve,theboundsarethevaluesforwhich5%oftheareaistotheleft
and5%oftheareaistotherightof(because5%and5%makeupthemissing10%)the
curve.
>qnorm(.05)
[1]-1.644854
>qnorm(.95)
[1]1.644854
>#noticethesymmetry?
Thatmeansthatforthisexample,the90%confidenceintervalis65.2and66.13or65.197
+-0.933.
Note
Awarningaboutconfidenceintervals
Therearemanymisconceptionsaboutconfidenceintervalsfloatingabout.Themost
pervasiveisthemisconceptionthat95%confidenceintervalsrepresenttheintervalsuch
thatthereisa95%chancethatthepopulationmeanisintheinterval.Thisisfalse.Once
theboundsarecreated,itisnolongeraquestionofprobability;thepopulationmeanis
eitherinthereorit’snot.
Toconvinceyourselfofthis,taketwosamplesfromthesamedistributionandcreate95%
confidenceintervalsforbothofthem.Theyaredifferent,right?Createafewmore.How
coulditbethecasethatalloftheseintervalshavethesameprobabilityofincludingthe
populationmean?
UsingaBayesianinterpretationofprobability,itispossibletosaythatthereexists
intervalsforwhichweare95%certainthatitencompassesthepopulationmean,since
Bayesianprobabilityisameasureofourcertainty,ordegreeofbelief,insomething.This
Bayesianresponsetoconfidenceintervalsiscalledcredibleintervals,andwewilllearn
abouttheminChapter7,BayesianMethods.Theprocedurefortheirconstructionisvery
differenttothatoftheconfidenceinterval.
Smallersamples
RememberwhenIsaidthatthesamplingdistributionofsamplemeansisapproximately
normalforalargeenoughsamplesize?Thiscaveatmeansthatforsmallersamplesizes
(usuallyconsideredtobebelow30),thesamplingdistributionofthesamplemeansisnot
wellapproximatedbyanormaldistribution.Itis,however,wellapproximatedbyanother
distribution:thet-distribution.
Note
Abitofhistory…
Thet-distributionisalsoknownastheStudent’st-distribution.Itgetsitsnamefromthe
1908paperthatintroducesit,byWilliamSealyGossetwritingunderthepenname
Student.GossetworkedasastatisticianattheGuinnessBreweryandusedthetdistributionandtherelatedt-testtostudysmallsamplesofthequalityofthebeer’sraw
constituents.HeisthoughttohaveusedapennameattherequestofGuinnesssothat
competitorswouldn’tknowthattheywereusingthetstatistictotheiradvantage.
Thet-distributionhastwoparameters,themeanandthedegreesoffreedom(ordf).For
ourpurposeshere,the‘degreesoffreedom’isequaltooursamplesize,-1.Forexample,
ifwehaveasampleof10fromsomepopulationandthemeanis5,thenat-distribution
withparametersmean=5anddf=9describesthesamplingdistributionofsamplemeans
withthatsamplesize.
Thet-distributionlooksalotlikethenormaldistributionatfirstglance.However,further
examinationwillrevealthatthecurveismoreflatandwide.Thiswidenessaccountsfor
thehigherlevelofuncertaintywehaveinregardtoasmallersample.
Figure5.6:Thenormaldistribution,andtwot-distributionswithdifferentdegreesof
freedom
Noticethatasthesamplesize(degreesoffreedom)increases,thedistributiongets
narrower.Asthesamplesizegetshigherandhigher,itgetscloserandclosertoanormal
distribution.By29degreesoffreedom,itisveryclosetoanormaldistributionindeed.
Thisiswhy30isconsideredagoodruleofthumbforwhatconstitutesagoodcut-off
betweenlargesamplesizesandsmallsamplesizesand,thus,whendecidingwhetherto
useanormaldistributionorat-distributionasamodelforthesamplingdistribution.
Let’ssaythatwecouldonlyaffordtakingtheheightsof15USwomen.What,then,isour
95%intervalestimation?
>small.sample<-sample(all.us.women,15)
>mean(small.sample)
[1]65.51277
>qt(.025,df=14)
[1]-2.144787
>#noticethedifference
>qnorm(.025)
[1]-1.959964
Insteadofusingtheqnormfunctiontogetthecorrectmultipliertothestandarderror,we
wanttofindthequantileofthet-distributionat.025(and.975).Forthis,weusetheqt
function,whichtakesaprobabilityandnumberofdegreesoffreedom.Notethatthe
quantileofthet-distributionislargerthanthequantileofthenormaldistribution,which
willtranslatetolargerconfidenceintervalbounds;again,thisreflectstheadditional
uncertaintywehaveinourestimateduetoasmallersamplesize.
>err<-sd(small.sample)/sqrt(length(small.sample))
>mean(small.sample)-(2.145*err)
[1]64.09551
>mean(small.sample)+(2.145*err)
[1]66.93003
Inthiscase,theboundsofour95%confidenceintervalare64.1and66.9.
Exercises
Practisethefollowingexercisestorevisetheconceptslearnedinthischapter:
Writeafunctionthattakesavectorandreturnsthe95%confidenceintervalforthat
vector.Youcanreturntheintervalasavectoroflengthtwo:thelowerboundandthe
upperbound.Then,parameterizetheconfidencecoefficientbylettingtheuserof
yourfunctionchoosetheirownconfidencelevel,butkeep95%asthedefault.Hint:
thefirstlinewilllooklikethis:
conf.int<-function(data.vector,conf.coeff=.95){
Backwhenweintroducedthecentrallimittheorem,Isaidthatthesampling
distributionfromanydistributionwouldbeapproximatelynormal.Don’ttakemy
wordforit!Createapopulationthatisuniformlydistributedusingtheruniffunction
andplotahistogramofthesamplingdistributionusingthecodefromthischapterand
thehistogram-plottingcodefromChapter2,TheShapeofData.Repeattheprocess
usingthebetadistributionwithparameters(a=0.5,b=0.5).Whatdoestheunderlying
distributionlooklike?Whatdoesthesamplingdistributionlooklike?
Aformalandrigorousdefinitionofknowledgeandwhatconstitutesknowledgeis
stillanopenprobleminepistemology.SincePlatoandhisdialogues,apopular
definitionofknowledgeistheJustifiedTrueBelief(JTB)account.Inthisaccount,
anagentcanbesaidtoknowsomething,p,if(a)pistrue,(b)theagentbelievesthat
pistrue,and(c)theagentisjustifiedinbelievingthatpistrue.Ina1963paper,
EdmundGettierintroducedexamplesthatseemtosatisfytheseconditions,butappear
nottobetruecasesofknowledge.ReadGettier’spaper.CantheJTBaccountof
knowledgebemodifiedtoaccountforGettierproblems?OrshouldwerejecttheJTB
accountofknowledgeandstartfromscratch?
Summary
Thecentralideaofthischapteristhatmakingtheleapfromsampletopopulationcarriesa
certainamountofuncertaintywithit.Inordertobegood,honestanalysts,weneedtobe
abletoexpressandquantifythisuncertainty.
Theexamplewechosetoillustratethisprinciplewasestimatingpopulationmeanfroma
sample’smean.Youlearnedthattheuncertaintyassociatedwithinferringthepopulation
meanfromsamplemeansismodeledbythesamplingdistributionofthesamplemeans.
Thecentrallimittheoremtellsustheparameterswecanexpectofthissampling
distribution.Youlearnedthatwecouldusetheseparametersontheirown,orinthe
constructionofconfidenceintervals,toexpressourlevelofuncertaintyaboutour
estimate.
Iwanttocongratulateyouforgettingthisfar.Thetopicsintroducedinthischapterare
veryoftenconsideredthemostdifficulttograspinallofintroductorydataanalysis.
Yourtenacitywillbegreatlyrewarded,though;wehavelaidenoughofafoundationtobe
abletogetintosomereal,practicaltopics.Ipromisethenextchapterisalotoffun,andit
isfilledwithinterestingexamplesthatyoucanstartapplyingtoreal-lifeproblemsright
away!
Chapter6.TestingHypotheses
Thesalt-and-pepperofinferentialstatisticsisestimationandtestinghypotheses.Inthelast
chapter,wetalkedaboutestimationandmakingcertaininferencesabouttheworld.Inthis
chapter,wewillbetalkingabouthowtotestthehypothesesonhowtheworldworksand
evaluatethehypothesesusingonlysampledata.
Inthelastchapter,Ipromisedthatthiswouldbeaverypracticalchapter,andI’mamanof
myword;thischaptergoesoverabroadrangeofthemostpopularmethodsinmodern
dataanalysisatarelativelyhighlevel.Evenso,thischaptermighthavealittlemoredetail
thanthelazyandimpatientwouldwant.Atthesametime,itwillhavewaytoolittledetail
thanwhattheextremelycuriousandmathematicallyinclinedwant.Infact,some
statisticianswouldhaveaheartattackatthedegreetowhichIskipoverthemathinvolved
withthesesubjects—butIwon’ttellifyoudon’t!
Nevertheless,certaincomplicatedconceptsandmatharebeyondthescopeofthisbook.
Thegoodnewsisthatonceyou,dearreader,havethegeneralconceptsdown,itiseasyto
deepenyourknowledgeofthesetechniquesandtheirintricacies—andIadvocatethatyou
dobeforemakinganymajordecisionsbasedonthetestsintroducedinthesechapters.
NullHypothesisSignificanceTesting
Forbetterorworse,NullHypothesisSignificanceTesting(NHST)isthemostpopular
hypothesistestingframeworkinmodernuse.So,eventhoughtherearecompeting
approachesthat—atleastinsomecases—arebetter,youneedtoknowthisstuffupand
down!
Okay—NullHypothesisSignificanceTesting—thoseareabunchofbigwords.Whatdo
theymean?
NHSTisalotlikebeingaprosecutorintheUnitedStates’orGreatBritain’sjustice
system.Inthesetwocountries—andafewothers—thepersonbeingchargedispresumed
innocent,andtheburdenofprovingthedefendant’sguiltisplacedontheprosecutor.The
prosecutorthenhastoarguethattheevidenceisinconsistentwiththedefendantbeing
innocent.Onlyafteritisshownthattheextantevidenceisunlikelyifthepersonis
innocent,doesthecourtruleaguiltyverdict.Iftheextantevidenceisweak,orislikelyto
beobservedevenifthedependentisinnocent,thenthecourtrulesnotguilty.Thatdoesn’t
meanthedefendantisinnocent(thedefendantmayverywellbeguilty!)—itmeansthat
eitherthedefendantwasguilty,ortherewasnotsufficientevidencetoproveguilt.
WithsimpleNHST,wearetestingtwocompetinghypotheses:thenullandthealternative
hypotheses.Thedefaulthypothesisiscalledthenullhypothesis—itisthehypothesisthat
ourobservationoccurredfromchancealone.Inthejusticesystemanalogy,thisisthe
hypothesisthatthedefendantisinnocent.Thealternativehypothesisistheopposite(or
complementary)hypothesis;thiswouldbeliketheprosecutor’shypothesis.
ThenullhypothesisterminologywasintroducedbyastatisticiannamedR.A.Fischerin
regardtothecuriouscaseofMurielBristol:awomanwhoclaimedthatshecoulddiscern,
justbytastingit,whethermilkwasaddedbeforeteainateacuporwhethertheteawas
pouredbeforethemilk.Sheismorecommonlyknownastheladytastingtea.
Herclaimwasputtothetest!Theladytastingteawasgiveneightcups;fourhadmilk
addedfirst,andfourhadteaaddedfirst.Hertaskwastocorrectlyidentifythefourcups
thathadteaaddedfirst.Thenullhypothesiswasthatshecouldn’ttellthedifferenceand
wouldchoosearandomfourteacups.Thealternativehypothesisis,ofcourse,thatshehad
theabilitytodiscernwithertheteaormilkwaspouredfirst.
Itturnedoutthatshecorrectlyidentifiedthefourcups.Thechancesofrandomlychoosing
thecorrectfourcupsis70to1,orabout1.4%.Inotherwords,thechancesofthat
happeningunderthenullhypothesisis1.4%.Giventhatitissoveryunlikelytohave
occurredunderthenullhypothesis,wemaychoosetorejectthenullhypothesis.Ifthenull
andalternativehypothesesaremutuallyexclusiveandcollectivelyexhaustive,thena
rejectionofthenullhypothesisistantamounttoanacceptanceofthealternative
hypothesis.
Wecan’tsayanythingforcertain,butwecanworkwithprobabilities.Inthisexample,we
wantedtoproveordisprovetheladytastingtea’sclaims.Wedidnottrytoevaluatethe
probabilitythattheladycouldtellthedifference;weassumedthatshecouldnotandtried
toshowthatitwasunlikelythatshecouldn’t,givenherstellarperformanceonthe
assessment.
So,here’sthebasicideabehindNHSTasweknowitsofar:
1. Assumetheoppositeofwhatyouaretesting.
2. (Tryto)showthattheresultsyoureceiveareunlikelygiventhatassumption.
3. Rejecttheassumption.
Wehaveheretoforebeenratherhand-wavyaboutwhatconstitutessufficientunlikelihood
torejectthenullhypothesisandhowwedeterminetheprobabilityinthefirstplace.We’ll
discussthisnow.
Inordertoquantifyhowlikelyorunlikelytheresultswereceiveare,weneedtodefinea
teststatistic—somemeasureofthesample.Thesamplingdistributionoftheteststatistic
willtelluswhichteststatisticsaremostlikelytooccurbychance(underthenull
hypothesis)withrepeatedtrialsoftheexperiment.Onceweknowwhatthesampling
distributionoftheteststatisticlookslike,wecantellwhattheprobabilityofgettinga
resultasextremeaswegotis.Thisiscalledap-value.Ifitisequaltoorbelowsomeprespecifiedboundary,calledanalphalevel(αlevel),wedecidethatthenullhypothesisisa
badhypothesisandembracethealternativehypothesis.Largely,asamatteroftradition,an
alphalevelof.05isusedmostoften,thoughotherlevelsareoccasionallyusedaswell.So,
iftheobservedresultwouldonlyoccur5%orlessofthetime(p-value<.05),weconsider
itasufficientlyunlikelyeventandrejectthenullhypothesis.Ifthe.05cut-offsounds
ratherarbitrary,it’sbecauseitis.
So,here’sourupdatedandexpandedbasicideabehindNHST:
1. Formulateasetoftwohypotheses:anullhypothesis(oftendenotedasH0)andan
alternativehypothesis(oftendenotedH1)
H0:thereisnoeffect
H1:thereisaneffect
2. Computetheteststatistic.
3. Giventhesamplingdistributionoftheteststatisticunderthenullhypothesis,youcan
calculatetheprobabilityofobtainingateststatisticequaltoormoreextremethanthe
oneyoucalculated.Thisisthep-value.Findit.
4. Iftheprobabilityofobtainingateststatisticbeingequaltoormoreextremethanthe
oneyoucalculatedissufficientlyunlikely(equaltoorlessthanyouralphalevel),
thenyoumayrejectthenullhypothesis.
5. Ifthenullandalternativehypothesesarecollectivelyexhaustive,youmayembrace
thealternativehypothesis.
Theillustrativeexamplethat’sgoingtomakesenseoutofallofthisisnoneotherthanthe
gambitofLarrytheUntrustworthyKnavethatwemetinChapter4,Probability.Ifyou
recall,Larry,whocanonlybetrustedsomeofthetime,gaveusacointhatheallegesis
fair.Weflipit30timesandobserve10heads.Let’shypothesizethatthecoinisunfair;
let’sformalizeourhypotheses:
H0(nullhypothesis):theprobabilityofobtainingheadsonthiscoinis.5
H1(alternativehypothesis):theprobabilityofobtainingheadsonthiscoinisnot.5
Let’sjustusethenumberofheadsinoursampleastheteststatistic.Whatisthesampling
distributionofthisteststatistic?Inotherwords,ifthecoinwerefair,andyourepeatedthe
flipping-30-timesexperimentmanytimes,whatistherelativefrequencyofobserving
particularnumbersofheads?We’veseenitalready!It’sthebinomialdistribution.A
binomialdistributionwithparametersn=30andp=0.5describesthenumberofheadswe
shouldexpectin30flips.
Figure6.1:Thesamplingdistributionofourcoin-flipteststatistic(thenumberofheads)
Asyoucansee,theoutcomethatisthemostlikelyisgetting15heads(asyoumight
imagine).Canyouseewhattheprobabilityofgetting10headsis?Fairlyunlikely,right?
So,what’sthep-value,andisitlessthanourpre-specifiedalphalevel?Well,wehave
alreadyworkedouttheprobabilityofobserving10orfewerheadsinChapter4,
Probability,asfollows:
>pbinom(10,size=30,prob=.5)
[1]0.04936857
It’slessthan.05.Wecanconcludethecoinisunfair,right?Well,yesandno.Mostlyno.
Allowmetoexplain.
Oneandtwo-tailedtests
Youmayrejectthenullhypothesisiftheteststatisticfallswithinaregionunderthecurve
ofthesamplingdistributionthatcovers5%ofthearea(ifthealphalevelis.05).Thisis
calledthecriticalregion.Doyouremember,inthelastchapter,weconstructed95%
confidenceintervalsthatcovered95%percentofthesamplingdistribution?Well,the5%
criticalregionisliketheoppositeofthis.Recallthat,inordertomakeasymmetric95%of
theareaunderthecurve,wehadtostartatthe.025quantileandendatthe.975quantile,
leaving2.5%percentonthelefttailand2.5%oftherighttailuncovered.
Similarly,inorderforthecriticalregionofahypothesistesttocover5%ofthemost
extremeareasunderthecurve,theareamustcovereverythingfromtheleftofthe.025
quantileandeverythingtotherightofthe.975quantile.
So,inordertodeterminethatthe10headsoutof30flipsisstatisticallysignificant,the
probabilitythatyouwouldobserve10orfewerheadshastobelessthan.025.
There’safunctionbuiltrightintoR,calledbinom.test,whichwillperformthe
calculationsthatwehave,untilnow,beendoingbyhand.Inthemostbasicincantationof
binom.test,thefirstargumentisthenumberofsuccessesinaBernoullitrial(thenumber
ofheads),andthesecondargumentisthenumberoftrialsinthesample(thenumberof
coinflips).
>binom.test(10,30)
Exactbinomialtest
data:10and30
numberofsuccesses=10,numberoftrials=30,p-value=0.09874
alternativehypothesis:trueprobabilityofsuccessisnotequalto0.5
95percentconfidenceinterval:
0.17287420.5281200
sampleestimates:
probabilityofsuccess
0.3333333
Ifyoustudytheoutput,you’llseethatthep-valuedoesnotcrossthesignificance
threshold.
Now,supposethatLarrysaidthatthecoinwasnotbiasedtowardstails.ToseeifLarry
waslying,weonlywanttotestthealternativehypothesisthattheprobabilityofheadsis
lessthan.5.Inthatcase,wewouldsetupourhypotheseslikethis:
H0:Theprobabilityofheadsisgreaterthanorequalto.5
H1:Theprobabilityofheadsislessthan.5
Thisiscalledadirectionalhypothesis,becausewehaveahypothesisthatassertsthatthe
deviationfromchancegoesinaparticulardirection.Inthishypothesissuite,weareonly
testingwhethertheobservedprobabilityofheadsfallsintoacriticalregionononlyone
sideofthesamplingdistributionoftheteststatistic.Thestatisticaltestthatwewould
performinthiscaseis,therefore,calledaone-tailedtest—thecriticalregiononlylieson
onetail.Sincetheareaofthecriticalregionnolongerhastobedividedbetweenthetwo
tails(likeinthetwo-tailedtestweperformedearlier),thecriticalregiononlycontainsthe
areatotheleftofthe.05quantile.
Figure6.2:Thethreepanels,fromlefttoright,depictthecriticalregionsoftheleft
(“lesser”)one-tailed,two-tailed,andright(“greater”)alternativehypotheses.The
dashedhorizontallineismeanttoshowthat,forthetwo-tailedtests,thecriticalregion
startsbelowp=.025,becauseitisbeingsplitbetweentwotails.Fortheone-tailedtests,
thecriticalregionisbelowthedashedhorizontallineatp=.05.
Asyoucanseefromthefigure,forthedirectionalalternativehypothesisthatheadshasa
probabilitylessthan.5,10headsisnowincludedinthegreencriticalregion.
Wecanusethebinom.testfunctiontotestthisdirectionalhypothesis,too.Allwehaveto
doisspecifytheoptionalparameteralternativeandsetitsvalueto"less"(itsdefaultis
"two.sided"foratwo-tailedtest).
>binom.test(10,30,alternative="less")
Exactbinomialtest
data:10and30
numberofsuccesses=10,numberoftrials=30,p-value=0.04937
alternativehypothesis:trueprobabilityofsuccessislessthan0.5
95percentconfidenceinterval:
0.00000000.4994387
sampleestimates:
probabilityofsuccess
0.3333333
Ifwewantedtotestthedirectionalhypothesisthattheprobabilityofheadswasgreater
than.5,wewouldusealternative="greater".
Takenoteofthefactthatthep-valueisnowlessthan.05.Infact,itisidenticaltothe
probabilitywegotfromthepbinomfunction.
Whenthingsgowrong
Certaintyisacardrarelyusedinthedeckofadataanalyst.Sincewemakejudgmentsand
inferencesbasedonprobabilities,mistakeshappen.Inparticular,therearetwotypesof
mistakesthatarepossibleinNHST:TypeIerrorsandTypeIIerrors.
ATypeIerroriswhenahypothesistestconcludesthatthereisaneffect(rejectsthe
nullhypothesis)when,inreality,nosucheffectexists
ATypeIIerroroccurswhenwefailtodetectarealeffectintheworldandfailto
rejectthenullhypothesisevenifitisfalse
Checkthefollowingtableforerrorsencounteredinthecoinexample:
Cointype
Failuretorejectnullhypothesis(concludeno
detectableeffect)
Rejectthenullhypothesis(concludethatthere
isaneffect)
Coinis
fair
Correctpositiveidentification
TypeIerror(falsepositive)
Coinis
unfair
TypeIIerror(falsenegative)
Correctidentification
Inthecriminaljusticesystem,TypeIerrorsareconsideredespeciallyheinous.Legal
theoristWilliamBlackstoneisfamousforhisquote:itisbetterthattenguiltypersons
escapethanoneinnocentsuffer.Thisiswhythecourtinstructsjurors(intheUnited
States,atleast)toonlyconvictthedefendantifthejurybelievesthedefendantisguilty
beyondareasonabledoubt.Theconsequenceisthatifthejuryfavorsthehypothesisthat
thedefendantisguilty,butonlybyalittlebit,thejurymustgivethedefendantthebenefit
ofthedoubtandacquit.
Thislineofreasoningholdsforhypothesistestingaswell.Sciencewouldbeinasorry
stateifweacceptedalternativehypothesesonratherflimsyevidencewilly-nilly;itis
betterthatweerronthesideofcautionwhenmakingclaimsabouttheworld,evenifthat
meansthatwemakefewerdiscoveriesofhonest-to-goodness,real-worldphenomena
becauseourstatisticaltestsfailedtoreachsignificance.
Thissentimentunderliesthatdecisiontouseanalphalevellike.05.Analphalevelof.05
meansthatwewillonlycommitaTypeIerror(falsepositive)5%ofthetime.Ifthealpha
levelwerehigher,wewouldmakefewerTypeIIerrors,butatthecostofmakingmore
TypeIerrors,whicharemoredangerousinmostcircumstances.
Thereisasimilarmetrictothealphalevel,anditiscalledthebetalevel(βlevel).Thebeta
levelistheprobabilitythatwewouldfailtorejectthenullhypothesisifthealternative
hypothesisweretrue.Inotherwords,itistheprobabilityofmakingaTypeIIerror.
Thecomplementofthebetalevel,1minusthebetalevel,istheprobabilityofcorrectly
detectingatrueeffectifoneexists.Thisiscalledpower.Thisvariesfromtesttotest.
Computingthepowerofatest,atechniquecalledpoweranalysis,isatopicbeyondthe
scopeofthisbook.Forourpurposes,itwillsufficetosaythatitdependsonthetypeof
testbeingperformed,thesamplesizebeingused,andonthesizeoftheeffectthatisbeing
tested(theeffectsize).Greatereffects,liketheaveragedifferenceinheightbetween
womenandmen,arefareasiertodetectthansmalleffects,liketheaveragedifferencein
thelengthofearthwormsinCarlisleandinBirmingham.Statisticiansliketoaimfora
powerofatleast80%(abetalevelof.2).Atestthatdoesn’treachthislevelofpower
(becauseofasmallsamplesizeorsmalleffectsize,andsoon)issaidtobeunderpowered.
Awarningaboutsignificance
It’sperhapsregrettablethatweusethetermsignificanceinrelationtonull-hypothesis
testing.Whenthetermwasfirstusedtodescribehypothesistests,thewordsignificance
waschosenbecauseitsignifiedsomething.AsIwrotethischapter,Icheckedthe
thesaurusforthewordsignificant,anditindicatedthatsynonymsincludenotable,worthy
ofattention,andimportant.Thisismisleadinginthatitisnotequivalenttoitsintended,
vestigialmeaning.Onethingthatreallyconfusespeopleisthattheythinkstatistical
significanceisofgreatimportanceinandofitself.Thisissadlyuntrue;thereareafew
waystoachievestatisticalsignificancewithoutdiscoveringanythingofsignificance,in
thecolloquialsense.
Aswe’llseelaterinthechapter,onewaytoachievenon-significantstatisticalsignificance
isbyusingaverylargesamplesize.Verysmalldifferences,thatmakelittletono
differenceintherealworld,willneverthelessbeconsideredstatisticallysignificantifthere
isalargeenoughsamplesize.
Forthisreason,manypeoplemakethedistinctionbetweenstatisticalsignificanceand
practicalsignificanceorclinicalrelevance.Manyholdtheviewthathypothesistesting
shouldonlybeusedtoanswerthequestionisthereaneffect?oristhereadiscernable
difference?,andthatthefollow-upquestionsisitimportant?ordoesitmakeareal
difference?shouldbeaddressedseparately.Isubscribetothispointofview.
Toanswerthefollow-upquestions,manyuseeffectsizes,which,asweknow,capturethe
magnitudeofaneffectintherealworld.Wewillseeanexampleofdeterminingtheeffect
sizeinatestlaterinthischapter.
Awarningaboutp-values
P-valuesare,byfar,themosttalkedaboutmetricinNHST.P-valuesarealsonotoriousfor
lendingthemselvestomisinterpretation.OfthemanycriticismsofNHST(ofwhichthere
aremany,inspiteofitsubiquity),themisinterpretationofp-valuesrankshighly.The
followingaretwoofthemostcommonmisinterpretations:
1. Ap-valueistheprobabilitythatthenullhypothesisistrue.Thisisnotthecase.
Someonemisinterpretingthep-valuefromourfirstbinomialtestmightconcludethat
thechancesofthecoinbeingfairarearound10%.Thisisfalse.Thep-valuedoesnot
tellustheprobabilityofthehypothesis’truthorfalsity.Infact,thetestassumesthat
thenullhypothesisiscorrect.Ittellsustheproportionoftrialsforwhichwewould
receivearesultasextremeormoreextremethantheonewedidifthenullhypothesis
wascorrect.I’mashamedtoadmitit,butImadethismistakeduringmyfirstcollege
introductorystatisticsclass.Inmyfinalprojectfortheclass,afterweeksofcollecting
data,Ifoundmyp-valuehadnotpassedthebarrierofsignificance—itwassomething
like.07.Iaskedmyprofessorif,afterthefact,Icouldchangemyalphalevelto.1so
myresultswouldbepositive.Inmyrequest,Iappealedtothefactthatitwasstill
moreprobablethannotthatmyalternativehypothesiswascorrect—afterall,ifmypvaluewas.07,thentherewasa93%chancethatthealternativehypothesiswas
correct.Hesmiledandtoldmetoreadtherelevantchapterofourtextagain.I
appreciatehimforhispatienceandrestraintinnotsmackingmerightintheheadfor
makingsuchastupidmistake.Don’tbelikeme.
2. Ap-valueisameasureofthesizeofaneffect.Thisisalsoincorrect,butits
wrongnessismoresubtlethanthefirstmisconception.Inresearchpapers,itis
commontoattachphraseslikehighlysignificantandveryhighlysignificanttopvaluesthataremuchsmallerthan.05(like.01and.001).Itiscommontointerpretpvaluessuchasthese,andstatementssuchasthese,assignalingabiggereffectthanpvaluesthatareonlymodestlylessthan.05.Thisisamistake;thisisconflating
statisticalsignificancewithpracticalsignificance.Intheprevioussection,we
explainedthatyoucanachievesignificantp-values(sometimesveryhighly
significantones)foraneffectthatis,forallintentsandpurposes,smalland
unimportant.Wewillseeaverysalientexampleofthislaterinthischapter.
Testingthemeanofonesample
Anillustrativeandfairlycommonstatisticalhypothesistestistheonesamplet-test.You
useitwhenyouhaveonesampleandyouwanttotestwhetherthatsamplelikelycame
fromapopulationbycomparingthemeanagainsttheknownpopulationmean.Forthis
testtowork,youhavetoknowthepopulationmean.
Inthisexample,we’llbeusingR’sbuilt-inprecipdatasetthatcontainsprecipitationdata
from70UScities.
>head(precip)
MobileJuneauPhoenixLittleRockLosAngelesSacramento
67.054.77.048.514.017.2
Don’tbefooledbythefactthattherearecitynamesinthere—thisisaregularoldvectorit’sjustthattheelementsarelabeled.Wecandirectlytakethemeanofthisvector,justlike
anormalone.
>is.vector(precip)
[1]TRUE
>mean(precip)
[1]34.88571
Let’spretendthatwe,somehow,knowthemeanprecipitationoftherestoftheworld—is
theUS’precipitationsignificantlydifferenttotherestoftheworld’sprecipitation?
Remember,inthelastchapter,Isaidthatthesamplingdistributionofsamplemeansfor
samplesizesunder30werebestapproximatedbyusingat-distribution.Well,thistestis
calledat-test,becauseinordertodecidewhetheroursamples’meanisconsistentwiththe
populationwhosemeanwearetestingagainst,weneedtoseewhereourmeanfallsin
relationtothesamplingdistributionofpopulationmeans.Ifthisisconfusing,rereadthe
relevantsectionfromthepreviouschapter.
Inordertousethet-testingeneralcases—regardlessofthescale—insteadofworking
withthesamplingdistributionofsamplemeans,weworkwiththesamplingdistributionof
thet-statistic.
Rememberz-scoresfromChapter3,DescribingRelationships?Thet-statisticislikeazscoreinthatitisascale-lessmeasureofdistancefromsomemean.Inthecaseofthetstatistic,though,wedividebythestandarderrorinsteadofthestandarddeviation(because
thestandarddeviationofthepopulationisunknown).Sincethet-statisticisstandardized,
anypopulation,withanymean,usinganyscale,willhaveasamplingdistributionofthetstatisticthatisexactlythesame(atthesamesamplesize,ofcourse).
Theequationtocomputethet-statisticisthis:
where isthesamplemean,µisthepopulationmean,sisthesample’standarddeviation,
andNisthesamplesize.
Let’sseeforourselveswhatthesamplingdistributionofthet-statisticlookslikebytaking
10,000samplesofsize70(thesamesizeasourprecipdataset)andplottingtheresults:
#functiontocomputet-statistic
t.statistic<-function(thesample,thepopulation){
numerator<-mean(thesample)-mean(thepopulation)
denominator<-sd(thesample)/sqrt(length(thesample))
t.stat<-numerator/denominator
return(t.stat)
}
#makethepretendpopulationnormallydistributed
#withameanof38
population.precipitation<-rnorm(100000,mean=38)
t.stats<-numeric(10000)
for(iin1:10000){
a.sample<-sample(population.precipitation,70)
t.stats[i]<-t.statistic(a.sample,population.precipitation)
}
#plot
library(ggplot2)
tmpdata<-data.frame(vals=t.stats)
qplot(vals,data=tmpdata,geom="histogram",
color=I("white"),
xlab="samplingdistributionoft-statistic",
ylab="frequency")
Figure6.3:Thesamplingdistributionofthet-statistic
Ah,there’sthatfamiliarshapeagain!
Fortunately,thesamplingdistributionofthet-statisticiswellknown,sowedon’thave
tocreateourown.Infact,thesamplingdistributionformanyteststatisticsarewell
known,sowewon’tberunningourownsimulationsofthemanymore.Luckyus!
Okay,sohowdoesoursample’st-statisticcomparetothet-distribution?Ourt-statistic,
usingourfunctionfromthelastcode-snippet,is:
>t.statistic(precip,population.precipitation)
[1]-1.901225
Though,youcanworkthisoutforyourselfeasily.
Figure6.4:Thet-distributionwith69degreesoffreedom.Thet-statisticofoursampleis
shownasthedashedline
Hmm,itlookslikeaprettyunlikelyoccurrencetome,butisitstatisticallysignificant?
First,let’sformallydefineourhypotheses:
H0=theaverage(mean)precipitationintheUSisequaltotheknownaverage
precipitationintherestoftheworld
H1=theaverage(mean)precipitationintheUSisdifferentthantheknownaverage
precipitationintherestoftheworld
Then,weprespecifyanalphalevelof.05,asiscustomary.
Sinceourhypothesisisnon-directional(weonlyhypothesizethattheprecipitationinthe
USisdifferentthantheworld,notlessormore),wedefineourcriticalregiontocover5%
oftheareaoneachsideofthecurve.
>qt(.025,df=69)
[1]-1.994945
>#thecriticalregionislessthan-1.995andmorethan+1.995
Whatdoesitlooklikenow?
Figure6.5:Thepreviousfigurewiththecriticalregionfornon-directionalhypothesis
highlighted
Oh,toobad!Itlookslikeoursamplemeanfallsjustoutofthecriticalregion.So,wefail
torejectthenullhypothesis.
Thecrueltruthifwe,forsomereason,hypothesizedthattheUSprecipitationwasless
thantheaverageworldprecipitationis:
H0=meanUSprecipitation>=meanworldprecipitation
H1=meanUSprecipitation<meanworldprecipitation
Wewouldhaveachievedsignificanceatalpha=.05.
Figure6.6:Figure6.4withdirectionalcriticalregionhighlighted
Ofcourse,wehavenoreasontothinkthatUSprecipitationwaslessormorethanthe
world’saverage.Andtochangeourhypothesisnowwouldbecheating.You’renota
cheater,areyou?
Nowthatweknowwhatwe’redoing,wewon’tbemanuallycalculatingourteststatistics
anymore;we’lljustbeusingthetestfunctionsthatRprovides.
Let’susethefunctionthatRprovidesnow.Theonesamplet-testcanbeperformedbythe
t.testfunction.Initsmostbasicform,ittakesavectorofsampleobservationsasitsfirst
argumentandthepopulationmeanasitssecondargument..
>t.test(precip,mu=38)
OneSamplet-test
data:precip
t=-1.901,df=69,p-value=0.06148
alternativehypothesis:truemeanisnotequalto38
95percentconfidenceinterval:
31.6174838.15395
sampleestimates:
meanofx
34.88571
Amongotherthings,thistesttellsusthatthet-statisticis1.9(justlikewecalculated
ourselves),thedegreesoffreedomwere69(thesamplesizeminus1),andthep-value,
whichis0.06148.Likeourplotwiththetwo-tailedcriticalregionsshowed,thisp-valueis
greaterthanourprespecifiedalphalevelof0.05.Wefailtorejectthenullhypothesis.
Justforkicks,let’sruntheone-tailedhypothesistest:
>t.test(precip,mu=38,alternative="less")
OneSamplet-test
data:precip
t=-1.901,df=69,p-value=0.03074
alternativehypothesis:truemeanislessthan38
95percentconfidenceinterval:
-Inf37.61708
sampleestimates:
meanofx
34.88571
Nowourp-valueis<.05.C’estlavie.
Note
NotethattheRoutputindicatesthatthealternativehypothesiswhichisthetruemeanis
lessthan38—comparethiswiththelastt-testoutput.
Assumptionsoftheonesamplet-test
Therearetwomainassumptionsoftheonesamplet-test:
Thedataaresampledfromanormaldistribution.Thisactuallyhasmoretodowith
thesamplingdistributionofsamplemeansbeingapproximatelynormalthanthe
actualpopulation.Asweknow,thesamplingdistributionofsamplemeansfor
sufficientlylargesamplesizeswillalwaysbenormallydistributed,evenifthe
populationisnot.Inreality,thisassumptioncanbeviolatedsomewhat,andthe
resultswillbevalid,especiallyforsamplesizesofover30.Wehavenothingtoworry
abouthere.Usually,peoplecheckthisassumptionbyplottingthesamplemeansand
makingsureit’skind-ofnormal,thoughtherearemoreformalwaysofdoingthis,
whichwewillseelater.Iftheassumptionofnormalityisinquestion,wemaywantto
useanalternativetest,likeanon-parametrictest;we’llseesomeexamplesattheend
ofthischapter.
Independenceofsamples:HadwetestedwhethertheUSprecipitationlikelycame
fromthepopulationoftheentireworld’sprecipitation,wewouldhavebeenviolating
thisassumption.Why?BecauseweknowthattheUSisamemberoftheset(itis
indeed‘intheworld’),soofcourseitwasdrawnfromthatpopulation.Thisiswhy
wetestedwhethertheUSprecipitationwasonparwiththerestoftheworld’s
precipitation.Inotherexamplesoftheonesamplet-tests,thisassumptionbasically
requiresthatthesampleberandom.
Testingtwomeans
Anevenmorecommonhypothesistestistheindependentsamplest-test.Youwoulduse
thistochecktheequalityoftwosamples’means.Concretely,anexampleofusingthistest
wouldbeifyouhaveanexperimentwhereyouaretestingtoseeifanewdruglowers
bloodpressure.Youwouldgiveonegroupaplaceboandtheothergroupthereal
medication.Ifthemeanimprovementinbloodpressurewassignificantlygreaterthanthe
improvementwiththeplacebo,youmightinferthatthebloodpressuremedicationworks.
Outsideofmoreacademicuses,webcompaniesusethistestallthetimetotestthe
effectivenessof,forexample,differentinternetadcampaigns;theyexposerandomusers
toeitheroneoftwotypesofadsandtestifoneismoreeffectivethantheother.Inwebbusinessparlance,thisiscalledanA-Btest,butthat’sjustbusiness-eseforcontrolled
experiment.
Thetermindependentmeansthatthetwosamplesareseparate,andthatdatafromone
sampledoesn’taffectdataintheother.Forexample,ifinsteadofhavingtwodifferent
groupsinthebloodpressuretrial,weusedthesameparticipantstotestboththeconditions
(randomizingtheorderweadministertheplaceboandtherealmedication),wewould
violateindependence.
ThedatasetwewillbeusingforthisisthemtcarsdatasetthatwefirstmetinChapter2,
TheShapeofDataandsawagaininChapter3,DescribingRelationships.Specifically,we
aregoingtotestthehypothesisthatthemileageisbetterformanualcarsthanitisforcars
withautomatictransmission.Let’scomparethemeansandproduceaboxplot:
>mean(mtcars$mpg[mtcars$am==0])
[1]17.14737
>mean(mtcars$mpg[mtcars$am==1])
[1]24.39231
>
>mtcars.copy<-mtcars
>#makenewcolumnwithbetterlabels
>mtcars.copy$transmission<-ifelse(mtcars$am==0,
"auto","manual")
>mtcars.copy$transmission<-factor(mtcars.copy$transmission)
>qplot(transmission,mpg,data=mtcars.copy,
+geom="boxplot",fill=transmission)+
+#nolegend
+guides(fill=FALSE)
Figure6.7:Boxplotofthemilespergallonratingsforautomaticcarsandcarswith
manualtransmission
Hmm,looksdifferent…butlet’scheckthathypothesisformally.Ourhypothesesare:
H0=meanofsample1-meanofsample2>=0
H1=meanofsample1-meanofsample2<0
Todothis,weusethet.testfunction,too;onlythistime,weprovidetwovectors:one
foreachsample.Wealsospecifyourdirectionalhypothesisinthesameway:
>automatic.mpgs<-mtcars$mpg[mtcars$am==0]
>manual.mpgs<-mtcars$mpg[mtcars$am==1]
>t.test(automatic.mpgs,manual.mpgs,alternative="less")
WelchTwoSamplet-test
data:automatic.mpgsandmanual.mpgs
t=-3.7671,df=18.332,p-value=0.0006868
alternativehypothesis:truedifferenceinmeansislessthan0
95percentconfidenceinterval:
-Inf-3.913256
sampleestimates:
meanofxmeanofy
17.1473724.39231
p<.05.Yipee!
Thereisaneasierwaytousethet-testforindependentsamplesthatdoesn’trequireusto
maketwovectors.
>t.test(mpg~am,data=mtcars,alternative="less")
Thisreads,roughly,performat-testofthempgcolumngroupingbytheamcolumninthe
dataframemtcars.Confirmforyourselfthattheseincantationsareequivalent.
Don’tbefooled!
RememberwhenIsaidthatstatisticalsignificancewasnotsynonymouswithimportant
andthatwecanuseverylargesamplesizestoachievestatisticalsignificancewithoutany
clinicalrelevance?Checkthissnippetout:
>set.seed(16)
>t.test(rnorm(1000000,mean=10),rnorm(1000000,mean=10))
WelchTwoSamplet-test
data:rnorm(1e+06,mean=10)andrnorm(1e+06,mean=10)
t=-2.1466,df=1999998,p-value=0.03183
alternativehypothesis:truedifferenceinmeansisnotequalto0
95percentconfidenceinterval:
-0.0058104638-0.0002640601
sampleestimates:
meanofxmeanofy
9.99791610.000954
Here,twovectorsofonemillionnormaldeviateseacharecreatedwithameanof10.
Whenweuseat-testonthesetwovectors,itshouldindicatethatthetwovectors’means
arenotsignificantlydifferent,right?
Well,wegotap-valueoflessthat.05—why?IfyoulookcarefullyatthelastlineoftheR
output,youmightseewhy;themeanofthefirstvectoris9.997916,andthemeanofthe
secondvectoris10.000954.Thistinydifference,ameagre.003,isenoughtotipthescale
intosignificantterritory.However,Icanthinkofveryfewapplicationsofstatisticswhere
.003ofanythingisnoteworthyeventhoughitis,technically,statisticallysignificant.
Thelargerpointisthatthet-testtestsforequalityofmeans,andifthemeansaren’t
exactlythesameinthepopulation,thet-testwill,withenoughpower,detectthis.Notall
tinydifferencesinpopulationmeansareimportant,though,soitisimportanttoframethe
resultsofat-testandthep-valueincontext.
Asmentionedearlierinthechapter,asalientstrategyforputtingthedifferencesincontext
istouseaneffectsize.Theeffectsizecommonlyusedinassociationwiththet-testis
Cohen’sd.Cohen’sdis,conceptually,prettysimple:itisaratioofthevarianceexplained
bythe“effect”andthevarianceinthedataitself.Concretely,Cohen’sdisthedifference
inmeansdividedbythesamplestandarddeviation.Ahighdindicatesthatthereisabig
effect(differenceinmeans)relativetotheinternalvariabilityofthedata.
Imentionedthattocalculated,youhavetodividethedifferenceinmeansbythesample
standarddeviation—butwhichone?AlthoughCohen’sdisconceptuallystraightforward
(evenelegant!),itisalsosometimesapaintocalculatebyhand,becausethesample
standarddeviationfrombothsampleshastobepooled.Fortunately,there’sanRpackage
thatlet’suscalculateCohen’sd—andothereffectsizemetrics,toboot,quiteeasily.Let’s
useitontheautovs.manualtransmissionexample:
>install.packages("effsize")
>library(effsize)
>cohen.d(automatic.mpgs,manual.mpgs)
Cohen'sd
destimate:-1.477947(large)
95percentconfidenceinterval:
infsup
-2.3372176-0.6186766
Cohen’sdis-1.478,whichisconsideredaverylargeeffectsize.Thecohen.dfunction
eventellsyouthisbyusingcannedinterpretationsofeffectsizes.Ifyoutrythiswiththe
twomillionelementvectorsfromabove,thecohen.dfunctionwillindicatethattheeffect
wasnegligible.
Althoughthesecannedinterpretationswereontargetthesetwotimes,makesureyou
evaluateyourowneffectsizesincontext.
Assumptionsoftheindependentsamplest-test
Homogeneityofvariance(orhomoscedasticity-ascarysoundingword),inthiscase,
simplymeansthatthevarianceinthemilespergallonoftheautomaticcarsisthesameas
thevarianceinmilespergallonofthemanualcars.Inreality,thisassumptioncanbe
violatedaslongasyouuseaWelch’sT-testlikewedid,insteadoftheStudent’sT-test.You
canstillusetheStudent’sT-testwiththet.testfunction,likebyspecifyingtheoptional
parametervar.equal=TRUE.Youcantestforthisformallyusingvar.testorleveneTest
fromthecarpackage.Ifyouaresurethattheassumptionofhomoscedasticityisnot
violated,youmaywanttodothisbecauseitisamorepowerfultest(fewerTypeIIerrors).
Nevertheless,IusuallyuseWelch’sT-testtobeonthesafeside.Also,alwaysuseWelch’s
testifthetwosamples’sizesaredifferent.
Thesamplingdistributionofthesamplemeansisapproximatelynormal:Again,with
alargeenoughsamplesize,italwaysis.Wedon’thaveaterriblylargesamplesize
here,butinreality,thisformulationofthet-testworksevenifthisassumptionis
violatedalittle.Wewillseealternativesinduetime.
Independence:LikeImentionedearlier,sincethesamplescontaincompletely
differentcars,we’reokayonthisfront.Forteststhat,forexample,usethesame
participantsforbothconditions,youwoulduseaDependentSamplesT-testorPaired
SamplesT-test,whichwewillnotdiscussinthisbook.Ifyouareinterestedin
runningoneofthesetestsaftersomeresearch,uset.test(<vector1>,<vector2>,
paired=TRUE).
Testingmorethantwomeans
Anotherreallycommonsituationrequirestestingwhetherthreeormoremeansare
significantlydiscrepant.Wewouldfindourselvesinthissituationifwehadthree
experimentalconditionsinthebloodpressuretrial:onegroupsgetsaplacebo,onegroup
getsalowdoseoftherealmedication,andonegroupsgetsahighdoseofthereal
medication.
Hmm,forcaseslikethese,whydon’twejustdoaseriesoft-tests?Forexample,wecan
testthedirectionalalternativehypotheses:
ThelowdoseofbloodpressuremedicationlowersBPsignificantlymorethanthe
placebo
ThehighdoseofbloodpressuremedicationlowersBPsignificantlymorethanthe
lowdose
Well,itturnsoutthatdoingthisfirstisprettydangerousbusiness,andthelogicgoeslike
this:ifouralphalevelis0.05,thenthechancesofmakingaTypeIerrorforonetestis
0.05;ifweperformtwotests,thenourchancesofmakingaTypeIerrorissuddenly
.09025(near10%).Bythetimeweperform10testsatthatalphalevel,thechancesofus
havingmakingaTypeIerroris40%.Thisiscalledthemultipletestingproblemor
multiplecomparisonsproblem.
Tocircumventthisproblem,inthecaseoftestingthreeormoremeans,weuseatechnique
calledAnalysisofVariance,orANOVA.AsignificantresultfromanANOVAleadstothe
inferencethatatleastoneofthemeansissignificantlydiscrepantfromoneoftheother
means;itdoesnotlenditselftotheinferencethatallthemeansaresignificantlydifferent.
Thisisanexampleofanomnibustest,becauseitisaglobaltestthatdoesn’ttellyou
exactlywherethedifferencesare,justthattherearedifferences.
YoumightbewonderingwhyatestofequalityofmeanshasanamecalledAnalysisof
Variance;it’sbecauseitdoesthisbycomparingthevariancebetweencasestothe
variancewithincases.ThegeneralintuitionbehindanANOVAisthatthehighertheratio
ofvariancebetweenthedifferentgroupsthanwithinthedifferentgroups,thelesslikely
thatthedifferentgroupsweresampledfromthesamepopulation.ThisratioiscalledanF
ratio.
ForourdemonstrationofthesimplestspeciesofANOVA(theone-wayANOVA),weare
goingtobeusingtheWeightLossdatasetfromthecarpackage.Ifyoudon’thavethecar
package,installit.
>library(car)
>head(WeightLoss)
groupwl1wl2wl3se1se2se3
1Control433141315
2Control443131417
3Control431171216
4Control321111112
5Control532161514
6Control654171818
>
>table(WeightLoss$group)
ControlDietDietEx
121210
TheWeightLossdatasetcontainspoundslostandselfesteemmeasurementsforthree
weeksforthreedifferentgroups:acontrolgroup,onegroupjustonadiet,andonegroup
thatdietedandexercised.Wewillbetestingthehypothesisthatthemeansoftheweight
lossatweek2arenotallequal:
H0=themeanweightlossatweek2betweenthecontrol,dietgroup,anddietand
exercisegroupareequal
H1=atleasttwoofthemeansofweightlossatweek2betweenthecontrol,diet
group,anddietandexercisegrouparenotequal
Beforethetest,let’scheckoutaboxplotofthemeans:
>qplot(group,wl2,data=WeightLoss,geom="boxplot",fill=group)
Figure6.8:Boxplotofweightlostinweek2oftrialforthreegroups:control,diet,and
diet&exercise
NowfortheANOVA…
>the.anova<-aov(wl2~group,data=WeightLoss)
>summary(the.anova)
DfSumSqMeanSqFvaluePr(>F)
group245.2822.64113.376.49e-05***
Residuals3152.481.693
--Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
Oh,snap!Thep-value(Pr(>F))is6.49e-05,whichis.000065ifyouhaven’tread
scientificnotationyet.
AsIsaidbefore,thisjustmeansthatatleastoneofthecomparisonsbetweenmeanswas
significant—therearefourwaysthatthiscouldoccur:
Themeansofdietanddietandexercisearedifferent
Themeansofdietandcontrolaredifferent
Themeansofcontrolanddietandexercisearedifferent
Themeansofcontrol,diet,anddietandexercisearealldifferent
Inordertoinvestigatefurther,weperformapost-hoctest.Quiteoften,thepost-hoctest
thatanalystsperformisasuiteoft-testscomparingeachpairofmeans(pairwiset-tests).
Butwait,didn’tIsaythatwasdangerousbusiness?Idid,butit’sdifferentnow:
Wehavealreadyperformedanhonest-to-goodnessomnibustestatthealphalevelof
ourchoosing.Onlyafterweachievesignificancedoweperformpairwiset-tests.
Wecorrectfortheproblemofmultiplecomparisons
TheeasiestmultiplecomparisoncorrectingproceduretounderstandisBonferroni
correction.Initssimplestversion,itsimplychangesthealphavaluebydividingitbythe
numberoftestsbeingperformed.Itisconsideredthemostconservativeofallthemultiple
comparisoncorrectionmethods.Infact,manyconsiderittooconservativeandI’m
inclinedtoagree.Instead,IsuggestusingacorrectingprocedurecalledHolm-Bonferroni
correction.Rusesthisbydefault.
>pairwise.t.test(WeightLoss$wl2,as.vector(WeightLoss$group))
PairwisecomparisonsusingttestswithpooledSD
data:WeightLoss$wl2andas.vector(WeightLoss$group)
ControlDiet
Diet0.28059DietEx7.1e-050.00091
Pvalueadjustmentmethod:holm
ThisoutputindicatesthatthedifferenceinmeansbetweentheDietandDietandexercise
groupsisp<.001.Additionally,itindicatesthatthedifferencebetweenDietand
exerciseandControlisp<.0001(lookatthecellwhereitsays7.1e-05).Thep-valueof
thecomparisonofjustdietandthecontrolis.28,sowefailtorejectthehypothesisthat
theyhavethesamemean.
AssumptionsofANOVA
Thestandardone-wayANOVAmakesthreemainassumptions:
Theobservationsareindependent
Thedistributionoftheresiduals(thedistancesbetweenthevalueswithinthegroups
totheirrespectivemeans)isapproximatelynormal
Homogeneityofvariance:Ifyoususpectthatthisassumptionisviolated,youcanuse
R’soneway.testinstead
Testingindependenceofproportions
RemembertheUniversityofCaliforniaBerkeleydatasetthatwefirstsawwhendiscussing
therelationshipbetweentwocategoricalvariablesinChapter3,DescribingRelationships.
RecallthatUCBwassuedbecauseitappearedasthoughtheadmissionsdepartment
showedpreferentialtreatmenttomaleapplicants.Alsorecallthatweusedcross-tabulation
tocomparetheproportionofadmissionsacrosscategories.
Ifadmissionrateswere,say10%,youwouldexpectaboutoneoutofeverytenapplicants
tobeacceptedregardlessofgender.Ifthisisthecase—thatgenderhasnobearingonthe
proportionofadmits—thengenderisindependent.
Smalldeviationsfromthis10%proportionare,ofcourse,tobeexpectedintherealworld
andnotnecessarilyindicativeofasexistadmissionsmachine.However,ifatestof
independenceofproportionsissignificant,thatindicatesthatadeviationasextremeasthe
oneweobservedisveryunlikelytooccurifthevariableweretrulyindependent.
Ateststatisticthatcapturesdivergencefromanidealized,perfectlyindependentcross
tabulationisthechi-squaredstatistic statistic),anditssamplingdistributionisknown
asachi-squaredistribution.Ifourchi-squarestatisticfallsintothecriticalregionofthe
chi-squaredistributionwiththeappropriatedegreesoffreedom,thenwerejectthe
hypothesisthatgenderisanindependentfactorinadmissions.
Let’sperformoneofthesechi-squaretestsonthewholeUCBAdmissionsdataset.
>#Thechi-squaretestfunctiontakesacross-tabulation
>#whichUCBAdmissionsalreadyis.Iamconvertingitfrom
>#andbacksothatyou,dearreader,canlearnhowtodo
>#thiswithotherdatathatisn'talreadyincross-tabulation
>#form
>ucba<-as.data.frame(UCBAdmissions)
>head(ucba)
AdmitGenderDeptFreq
1AdmittedMaleA512
2RejectedMaleA313
3AdmittedFemaleA89
4RejectedFemaleA19
5AdmittedMaleB353
6RejectedMaleB207
>
>#createcross-tabulation
>cross.tab<-xtabs(Freq~Gender+Admit,data=ucba)
>
>chisq.test(cross.tab)
Pearson'sChi-squaredtestwithYates'continuitycorrection
data:cross.tab
X-squared=91.6096,df=1,p-value<2.2e-16
Theproportionsarealmostcertainlynotindependent(p<.0001).Beforeyouconclude
thattheadmissionsdepartmentissexist,rememberSimpson’sParadox?Ifyoudon’t,
rereadtherelevantsectioninChapter3,DescribingRelationships.
Sincethechi-squareindependenceofproportiontestcanbe(andisoftenused)tocompare
awholemessofproportions,it’ssometimesreferredtoanomnibustest,justlikethe
ANOVA.Itdoesn’ttelluswhatproportionsaresignificantlydiscrepant,onlythatsome
proportionsare.
Whatifmyassumptionsareunfounded?
Thet-testandANOVAarebothconsideredparametricstatisticaltests.Theword
parametricisusedindifferentcontextstosignaldifferentthingsbut,essentially,itmeans
thatthesetestsmakecertainassumptionsabouttheparametersofthepopulation
distributionsfromwhichthesamplesaredrawn.Whentheseassumptionsaremet(with
varyingdegreesoftolerancetoviolation),theinferencesareaccurate,powerful(inthe
statisticalsense),andareusuallyquicktocalculate.Whenthoseparametricassumptions
areviolated,though,parametrictestscanoftenleadtoinaccurateresults.
We’vespokenabouttwomainassumptionsinthischapter:normalityandhomogeneityof
variance.Imentionedthat,eventhoughyoucantestforhomogeneityofvariancewiththe
leveneTestfunctionfromthecarpackage,thedefaultt.testinRremovesthis
restriction.Ialsomentionedthatyoucouldusetheoneway.testfunctioninlieuofaovif
youdon’thavetohavetoadheretothisassumptionwhenperforminganANOVA.Dueto
theseaffordances,I’lljustfocusontheassumptionofnormalityfromnowon.
Inat-test,theassumptionthatthesampleisanapproximatelynormaldistributioncanbe
visuallyverified,toacertainextent.Thenaïvewayistosimplymakeahistogramofthe
data.AmoreproperapproachistouseaQQ-plot(quantile-quantileplot).Youcanview
aQQ-plotinRbyusingtheqqPlotfunctionfromthecarpackage.Let’suseittoevaluate
thenormalityofthemilespergallonvectorinmtcars.
>library(car)
>qqPlot(mtcars$mpg)
Figure6.9:AQQ-plotofthemilepergallonvectorinmtcars
AQQ-plotcanactuallybeusedtocompareanysamplefromanytheoreticaldistribution,
butitismostoftenassociatedwiththenormaldistribution.Theplotdepictsthequantiles
ofthesampleandthequantilesofthenormaldistributionagainsteachother.Ifthesample
wereperfectlynormal,thepointswouldfallonthesolidreddiagonalline—itsdivergence
fromthislinesignalsadivergencefromnormality.Eventhoughitisclearthatthe
quantilesformpgdon’tpreciselycomportwiththequantilesofthenormaldistribution,its
divergenceisrelativelyminor.
Themostpowerfulmethodforevaluatingadherencetotheassumptionofnormalityisto
useastatisticaltest.WearegoingtousetheShapiro-Wilktest,becauseit’smyfavorite,
thoughthereareafewothers.
>shapiro.test(mtcars$mpg)
Shapiro-Wilknormalitytest
data:mtcars$mpg
W=0.9476,p-value=0.1229
Thisnon-significantresultindicatesthatthedeviationsfromnormalityarenotstatistically
significant.
ForANOVAs,theassumptionofnormalityappliestotheresiduals,nottheactualvalues
ofthedata.AfterperformingtheANOVA,wecancheckthenormalityoftheresiduals
quiteeasily:
>#I'mrepeatingtheset-up
>library(car)
>the.anova<-aov(wl2~group,data=WeightLoss)
>
>shapiro.test(the.anova$residuals)
Shapiro-Wilknormalitytest
data:the.anova$residuals
W=0.9694,p-value=0.4444
We’reintheclear!
Butwhatifwedoviolateourparametricassumptions!?Incaseslikethese,manyanalysts
willfallbackonusingnon-parametrictests.
Manystatisticaltests,includingthet-testandANOVA,havenon-parametricalternatives.
Theappealofthesetestsis,ofcourse,thattheyareresistanttoviolationsofparametric
assumptions—thattheyarerobust.Thedrawbackisthatthesetestsareusuallyless
powerfulthantheirparametriccounterparts.Inotherwords,theyhaveasomewhat
diminishedcapacityfordetectinganeffectiftheretrulyisonetodetect.Forthisreason,if
youaregoingtouseNHST,youshouldusethemorepowerfultestsbydefault,andswitch
onlyifyou’reassumptionsareviolated.
Thenon-parametricalternativetotheindependentt-testiscalledtheMann-WhitneyUtest,
thoughitisalsoknownastheWilcoxonrank-sumtest.Asyoumightexpectbynow,there
isafunctiontoperformthistestinR.Let’suseitontheautovs.manualtransmission
example:
>wilcox.test(automatic.mpgs,manual.mpgs)
Wilcoxonranksumtestwithcontinuitycorrection
data:automatic.mpgsandmanual.mpgs
W=42,p-value=0.001871
alternativehypothesis:truelocationshiftisnotequalto0
Simple!
Thenon-parametricalternativetotheone-wayANOVAiscalledtheKruskal-Wallistest.
CanyouseewhereI’mgoingwiththis?
>kruskal.test(wl2~group,data=WeightLoss)
Kruskal-Wallisranksumtest
data:wl2bygroup
Kruskal-Wallischi-squared=14.7474,df=2,p-value=0.0006275
Super!
Exercises
Hereareafewexercisesforyoutopractiseandrevisetheconceptslearnedinthischapter:
Readaboutdata-dredgingandp-hacking.Whyisitdangerousnottoformulatea
hypothesis,setanalphalevel,andsetasamplesizebeforecollectingdataand
analyzingresults?
Usethecommandlibrary(help="datasets")tofindalistofdatasetsthatRhas
alreadybuiltin.Pickafewinterestingones,andformahypothesisabouteachone.
Rigorouslydefineyournullandalternativehypothesesbeforeyoustart.Testthose
hypothesesevenifitmeanslearningaboutotherstatisticaltests.
Howmightyouquantifytheeffectsizeofaone-wayANOVA.Lookupeta-squared
ifyougetstuck.
Inethics,thedoctrineofmoralrelativismholdsthattherearenouniversalmoral
truths,andthatmoraljudgmentsaredependentuponone’scultureorperiodin
history.Howcanmoralprogress(theabolitionofslavery,fairertradingpractices)be
reconciledwitharelativisticviewofmorality?Ifthereisnoobjectivemoral
paradigm,howcancriticismsbelodgedagainstthecurrentviewsofmorality?Why
replaceexistingmoraljudgmentswithothersifthereisnostandardtowhichto
comparethemtoand,therefore,noreasontopreferoneovertheother.
Summary
Wecoveredhugegroundinthischapter.Bynow,youshouldbeuptospeedonsomeof
themostcommonstatisticaltests.Moreimportantly,youshouldhaveasolidgraspofthe
theorybehindNHSTandwhyitworks.Thisknowledgeisfarmorevaluablethan
mechanicallymemorizingalistofstatisticaltestsandcluesforwhentouseeach.
YoulearnedthatNHSThasitsoriginintestingwhetheraweirdlady’sclaimsabouttasting
teaweretrueornot.ThegeneralprocedureforNHSTistodefineyournullandalternative
hypotheses,defineandcalculateyourteststatistic,determinetheshapeandparametersof
thesamplingdistributionofthatteststatistic,measuretheprobabilitythatyouwould
observeateststatisticasormoreextremethantheoneweobserved(thisisthep-value),
anddeterminewhethertorejectorfailtorejectthenullhypothesisbasedonthewhether
thep-valuewasbeloworabovethealphalevel.
Youthenlearnedaboutonevs.two-tailedtests,TypeIandTypeIIerrors,andgotsome
warningsaboutterminologyandcommonNHSTmisconceptions.
Then,youlearnedalitanyofstatisticaltests—wesawthattheonesamplet-testisusedin
scenarioswherewewanttodetermineifasample’smeanissignificantlydiscrepantfrom
someknownpopulationmean;wesawthatindependentsamplest-testsareusedto
comparethemeansoftwodistinctsamplesagainsteachother;wesawthatweuseonewayANOVAsfortestingmultiplemeans,whyit’sinappropriatetojustperformabunch
oft-tests,andsomemethodsofcontrollingTypeIerrorrateinflation.Finally,youlearned
howthechi-squaretestisusedtochecktheindependenceofproportions.
Wethendirectlyappliedwhatyoulearnedtoreal,fundataandtestedreal,funhypotheses.
Theywerefun…right!?
Lastly,wediscussedparametricassumptions,howtoverifythattheyweremet,andone
optionforcircumventingtheirviolationatthecostofpower:non-parametrictests.We
learnedthatthenon-parametricalternativetotheindependentsamplest-testisavailablein
Raswilcox.test,andthenon-parametricalternativetotheone-wayANOVAis
availableinRusingthekruskal.testfunction.
Inthenextchapter,wewillalsobediscussingmechanismsfortestinghypotheses,butthis
time,wewillbeusinganattractivealternativetoNHSTbasedonthefamoustheoremby
ReverendThomasBayesthatyoulearnedaboutinChapter4,Probability.You’llseehow
thisothermethodofinferenceaddressessomeoftheshortcomings(deservedornot)of
NHST,andwhyit’sgainingpopularityinmodernapplieddataanalysis.Seeyouthere!
Chapter7.BayesianMethods
SupposeIclaimthatIhaveapairofmagicrainbowsocks.IallegethatwheneverIwear
thesespecialsocks,Igaintheabilitytopredicttheoutcomeofcointosses,usingfair
coins,betterthanchancewoulddictate.Puttingmyclaimtothetest,youtossacoin30
times,andIcorrectlypredicttheoutcome20times.Usingadirectionalhypothesiswith
thebinomialtest,thenullhypothesiswouldberejectedatalpha-level0.05.Wouldyou
investinmyspecialsocks?
Whynot?Ifit’sbecauseyourequirealargerburdenofproofonabsurdclaims,Idon’t
blameyou.AsagrandparentofBayesiananalysisPierre-SimonLaplace(who
independentlydiscoveredthetheoremthatbearsThomasBayes’name)oncesaid:The
weightofevidenceforanextraordinaryclaimmustbeproportionedtoitsstrangeness.
Ourpriorbelief—myabsurdhypothesis—issosmallthatitwouldtakemuchstronger
evidencetoconvincetheskepticalinvestor,letalonethescientificcommunity.
Unfortunately,ifyou’dliketoeasilyincorporateyourpriorbeliefsintoNHST,you’reout
ofluck.Orsupposeyouneedtoassesstheprobabilityofthenullhypothesis;you’reoutof
luckthere,too;NHSTassumesthenullhypothesisandcan’tmakeclaimsaboutthe
probabilitythataparticularhypothesisistrue.Incaseslikethese(andingeneral),you
maywanttouseBayesianmethodsinsteadoffrequentistmethods.Thischapterwilltell
youhow.Joinme!
ThebigideabehindBayesiananalysis
IfyourecallfromChapter4,Probability,theBayesianinterpretationofprobabilityviews
probabilityasourdegreeofbeliefinaclaimorhypothesis,andBayesianinferencetellsus
howtoupdatethatbeliefinthelightofnewevidence.Inthatchapter,weusedBayesian
inferencetodeterminetheprobabilitythatemployeesofDaisyGirl,Inc.wereusingan
illegaldrug.Wesawhowtheincorporationofpriorbeliefssavedtwoemployeesfrom
beingfalselyaccusedandhelpedanotheremployeegetthehelpsheneededeventhough
herdrugscreenwasfalselynegative.
Inageneralsense,Bayesianmethodstellushowtodoleoutcredibilitytodifferent
hypotheses,givenpriorbeliefinthosehypothesesandnewevidence.Inthedrugexample,
thehypothesissuitewasdiscrete:druguserornotdruguser.Morecommonly,though,
whenweperformBayesiananalysis,ourhypothesisconcernsacontinuousparameter,or
manyparameters.Ourposterior(orupdatedbeliefs)wasalsodiscreteinthedrugexample,
butBayesiananalysisusuallyyieldsacontinuousposteriorcalledaposteriordistribution.
WearegoingtouseBayesiananalysistoputmymagicalrainbowsocksclaimtothetest.
OurparameterofinterestistheproportionofcointossesthatIcancorrectlypredict
wearingthesocks;we’llcallthisparameterθ,ortheta.Ourgoalistodeterminewhatthe
mostlikelyvaluesofthetaareandwhethertheyconstituteproofofmyclaim.
ReferbacktothesectiononBayes’theoreminChapter4,ProbabilityRecallthatthe
posteriorwasthepriortimesthelikelihooddividedbyanormalizingconstant.This
normalizingconstantisoftendifficulttocompute.Luckily,sinceitdoesn’tchangethe
shapeoftheposteriordistribution,andwearecomparingrelativelikelihoodsand
probabilitydensities,Bayesianmethodsoftenignorethisconstant.So,allweneedisa
probabilitydensityfunctiontodescribeourpriorbeliefandalikelihoodfunctionthat
describesthelikelihoodthatwewouldgettheevidencewereceivedgivendifferent
parametervalues.
Thelikelihoodfunctionisabinomialfunction,asitdescribesthebehaviorofBernoulli
trials;thebinomiallikelihoodfunctionforthisevidenceisshowninFigure7.1:
Figure7.1:Thelikelihoodfunctionofthetafor20outof30successfulBernoullitrials.
Fordifferentvaluesoftheta,therearevaryingrelativelikelihoods.Notethatthevalueof
thetathatcorrespondstothemaximumofthelikelihoodfunctionis0.667,whichisthe
proportionofsuccessfulBernoullitrials.Thismeansthatintheabsenceofanyother
information,themostlikelyproportionofcoinflipsthatmymagicsocksallowmeto
predictis67%.ThisiscalledtheMaximumLikelihoodEstimate(MLE).
So,wehavethelikelihoodfunction;nowwejustneedtochooseaprior.Wewillbe
craftingarepresentationofourpriorbeliefsusingatypeofdistributioncalledabeta
distribution,forreasonsthatwe’llseeverysoon.
Sinceourposteriorisablendofthepriorandlikelihoodfunction,itiscommonfor
analyststouseapriorthatdoesn’tmuchinfluencetheresultsandallowsthelikelihood
functiontospeakforitself.Tothisend,onemaychoosetouseanon-informativeprior
thatassignsequalcredibilitytoallvaluesoftheta.Thistypeofnon-informativeprioris
calledaflatoruniformprior.
Thebetadistributionhastwohyper-parameters,α(oralpha)andβ(orbeta).Abeta
distributionwithhyper-parametersα=β=1describessuchaflatprior.Wewillcallthis
prior#1.
Note
Theseareusuallyreferredtoasthebetadistribution’sparameters.Wecallthemhyperparametersheretodistinguishthemfromourparameterofinterest,theta.
Figure7.2:Aflatprioronthevalueoftheta.Thisbetadistribution,withalphaandbeta=
1,confersanequallevelofcredibilitytoallpossiblevaluesoftheta,ourparameterof
interest.
Thispriorisn’treallyindicativeofourbeliefs,isit?Dowereallyassignasmuch
probabilitytomysocksgivingmeperfectcoin-flippredictionpowersaswedotothe
hypothesisthatI’mfullofbaloney?
Thepriorthataskepticmightchooseinthissituationisonethatlooksmoreliketheone
depictedinFigure7.3,abetadistributionwithhyper-parametersalpha=beta=50.
This,ratherappropriately,assignsfarmorecredibilitytovaluesofthetathatare
concordantwithauniversewithoutmagicalrainbowsocks.Asgoodscientists,though,we
havetobeopen-mindedtonewpossibilities,sothisdoesn’truleoutthepossibilitythatthe
socksgivemespecialpowers—theprobabilityislow,butnotzero,forextremevaluesof
theta.Wewillcallthisprior#2.
Figure7.3:Askeptic’sprior
BeforeweperformtheBayesianupdate,IneedtoexplainwhyIchosetousethebeta
distributiontodescribemypriors.
TheBayesianupdate—gettingtotheposterior—isperformedbymultiplyingtheprior
withthelikelihood.InthevastmajorityofapplicationsofBayesiananalysis,wedon’t
knowwhatthatposteriorlookslike,sowehavetosamplefromitmanytimestogeta
senseofitsshape.Wewillbedoingthislaterinthischapter.
Forcaseslikethis,though,wherethelikelihoodisabinomialfunction,usingabeta
distributionforourpriorguaranteesthatourposteriorwillalsobeinthebetadistribution
family.Thisisbecausethebetadistributionisaconjugatepriorwithrespecttoabinomial
likelihoodfunction.Therearemanyothercasesofdistributionsbeingself-conjugatewith
respecttocertainlikelihoodfunctions,butitdoesn’toftenhappeninpracticethatwefind
ourselvesinapositiontousethemaseasilyaswecanforthisproblem.Thebeta
distributionalsohasthenicepropertythatitisnaturallyconfinedfrom0to1,justlikethe
proportionofcoinflipsIcancorrectlypredict.
Thefactthatweknowhowtocomputetheposteriorfromthepriorandlikelihoodbyjust
changingthebetadistribution’shyper-parametersmakesthingsreallyeasyinthiscase.
Thehyper-parametersoftheposteriordistributionare:
Thatmeanstheposteriordistributionusingprior#1willhavehyper-parameters
alpha=1+20andbeta=1+10.ThisisshowninFigure7.4.
Figure7.4:TheresultoftheBayesianupdateoftheevidenceandprior#1.Theinterval
depictsthe95%credibleinterval(thedensest95%oftheareaundertheposterior
distribution).Thisintervaloverlapsslightlywiththeta=0.5.
Acommonwayofsummarizingtheposteriordistributioniswithacredibleinterval.The
credibleintervalontheplotinFigure7.4isthe95%credibleintervalandcontains95%of
thedensestareaunderthecurveoftheposteriordistribution.
Donotconfusethiswithaconfidenceinterval.Thoughitmaylooklikeit,thiscredible
intervalisverydifferentthanaconfidenceinterval.Sincetheposteriordirectlycontains
informationabouttheprobabilityofourparameterofinterestatdifferentvalues,itis
admissibletoclaimthatthereisa95%chancethatthecorrectparametervalueisinthe
credibleinterval.Wecouldmakenosuchclaimwithconfidenceintervals.Pleasedonot
mixupthetwomeanings,orpeoplewilllaughyououtoftown.
Observethatthe95%mostlikelyvaluesforthetacontainthethetavalue0.5,ifonly
barely.Duetothis,onemaywishtosaythattheevidencedoesnotruleoutthepossibility
thatI’mfullofbaloneyregardingmymagicalrainbowsocks,buttheevidencewas
suggestive.
Tobeclear,theendresultofourBayesiananalysisistheposteriordistributiondepicting
thecredibilityofdifferentvaluesofourparameter.Thedecisiontointerpretthisas
sufficientorinsufficientevidenceformyoutlandishclaimisadecisionthatisseparate
fromtheBayesiananalysisproper.IncontrasttoNHST,theinformationwegleanfrom
Bayesianmethods—theentireposteriordistribution—ismuchricher.Anotherthingthat
makesBayesianmethodsgreatisthatyoucanmakeintuitiveclaimsabouttheprobability
ofhypothesesandparametervaluesinawaythatfrequentistNHSTdoesnotallowyouto
do.
Whatdoesthatposteriorusingprior#2looklike?It’sabetadistributionwithalpha=
50+20andbeta=50+10:
>curve(dbeta(x,70,60),#plotabetadistribution
+xlab="θ",#namex-axis
+ylab="posteriorbelief",#namey-axis
+type="l",#makesmoothline
+yaxt='n')#removeyaxislabels
>abline(v=.5,lty=2)#makelineattheta=0.5
Figure7.5:Posteriordistributionofthetausingprior#2
Choosingaprior
Noticethattheposteriordistributionlooksalittledifferentdependingonwhatprioryou
use.ThemostcommoncriticismlodgedagainstBayesianmethodsisthatthechoiceof
prioraddsanunsavorysubjectiveelementtoanalysis.Toacertainextent,they’reright
abouttheaddedsubjectiveelement,buttheirallegationthatitisunsavoryiswayoffthe
mark.
Toseewhy,checkoutFigure7.6,whichshowsbothposteriordistributions(frompriors
#1and#2)inthesameplot.Noticehowpriors#1and#2—twoverydifferentpriors—
giventheevidence,produceposteriorsthatlookmoresimilartoeachotherthanthepriors
did.
Figure7.6:Theposteriordistributionsfromprior#1and#2
NowdirectyourattentiontoFigure7.7,whichshowstheposteriorofbothpriorsifthe
evidenceincluded80outof120correcttrials.
Figure7.7:Theposteriordistributionsfromprior#1and#2withmoreevidence
Notethattheevidencestillcontains67%correcttrials,butthereisnowmoreevidence.
Theposteriordistributionsarenowfarmoresimilar.Noticethatnowbothofthe
posteriors’credibleintervalsdonotcontaintheta=0.5;with80outof120trials
correctlypredicted,eventhemostobstinateskeptichastoconcedethatsomethingisgoing
on(thoughtheywillprobablydisagreethatthepowercomesfromthesocks!).
Takenoticealsoofthefactthatthecredibleintervals,inbothposteriors,arenow
substantiallynarrowing,illustratingmoreconfidenceinourestimate.
Finally,imaginethecasewhereIcorrectlypredicted67%ofthetrials,butoutof450total
trials.TheposteriorsderivedfromthisevidenceareshowninFigure7.8:
Figure7.8:Theposteriordistributionsfromprior#1and#2withevenmoreevidence
Theposteriordistributionsarelookingverysimilar—indeed,theyarebecomingidentical.
Givenenoughtrials—givenenoughevidence—theseposteriordistributionswillbeexactly
thesame.Whenthereisenoughevidenceavailablesuchthattheposteriorisdominatedby
itcomparedtotheprior,itiscalledoverwhelmingtheprior.
Aslongasthepriorisreasonable(thatis,itdoesn’tassignaprobabilityof0to
theoreticallyplausibleparametervalues),givenenoughevidence,everybody’sposterior
beliefwilllookverysimilar.
Thereisnothingunsavoryormisleadingaboutananalysisthatusesasubjectiveprior;the
analystjusthastodisclosewhatherprioris.Youcan’tjustpickapriorwilly-nilly;ithas
tobejustifiabletoyouraudience.Inmostsituations,apriormaybeinformedbyprior
evidencelikescientificstudiesandcanbesomethingthatmostpeoplecanagreeon.A
moreskepticalaudiencemaydisagreewiththechosenprior,inwhichcasetheanalysis
canbere-runusingtheirprior,justlikewedidinthemagicsocksexample.Itissometimes
okayforpeopletohavedifferentpriorbeliefs,anditisokayforsomepeopletorequirea
littlemoreevidenceinordertobeconvincedofsomething.
Thebeliefthatfrequentisthypothesistestingismoreobjective,andthereforemorecorrect,
ismistakeninsofarasitcausesallpartiestohaveaholdonthesamepotentiallybad
assumptions.TheassumptionsinBayesiananalysis,ontheotherhand,arestatedclearly
fromthestart,madepublic,andareauditable.
Torecap,therearethreesituationsyoucancomeacross.Inallofthese,itmakessenseto
useBayesianmethods,ifthat’syourthing:
Youhavealotofevidence,anditmakesnorealdifferencewhichpriorany
reasonablepersonuses,becausetheevidencewilloverwhelmit.
Youhaveverylittleevidence,buthavetomakeanimportantdecisiongiventhe
evidence.Inthiscase,you’dbefoolishtonotuseallavailableinformationtoinform
yourdecisions.
Youhaveamediumamountofevidence,anddifferentposteriorsillustratethe
updatedbeliefsfromadiversearrayofpriorbeliefs.Youmayrequiremoreevidence
toconvincetheextremelyskeptical,butthemajorityofinterestedpartieswillbe
cometothesameconclusions.
Whocaresaboutcoinflips
Whocaresaboutcoinflips?Well,virtuallynoone.However,(a)coinflipsareagreat
simpleapplicationtogetthehangofBayesiananalysis;(b)thekindsofproblemsthata
betapriorandabinomiallikelihoodfunctionsolvegowaybeyondassessingthefairness
ofcoinflips.WearenowgoingtoapplythesametechniquetoareallifeproblemthatI
actuallycameacrossinmywork.
Formyjob,Ihadtocreateacareerrecommendationsystemthataskedtheuserafew
questionsabouttheirpreferencesandspatoutsomecareerstheymaybeinterestedin.
Afterafewhours,Ihadaworkingprototype.Inordertojustifyputtingmoreresources
intoimprovingtheproject,IhadtoprovethatIwasontosomethingandthatmycurrent
recommendationsperformedbetterthanchance.
Inordertotestthis,wegot40peopletogether,askedthemthequestions,andpresented
themwithtwosetsofrecommendations.OnewasthetruesetofrecommendationsthatI
cameupwith,andonewasacontrolset—therecommendationsofapersonwhoanswered
thequestionsrandomly.Ifmysetofrecommendationsperformedbetterthanchance
woulddictate,thenIhadagoodthinggoing,andcouldjustifyspendingmoretimeonthe
project.
Simplyperformingbetterthanchanceisnogreatfeatonitsown—Ialsowantedreally
goodestimatesofhowmuchbetterthanchancemyinitialrecommendationswere.
Forthisproblem,IbrokeoutmyBayesiantoolbox!Theparameterofinterestisthe
proportionofthetimemyrecommendationsperformedbetterthanchance.If.05and
lowerwereveryunlikelyvaluesoftheparameter,asfarastheposteriordepicted,thenI
couldconcludethatIwasontosomething.
EventhoughIhadstrongsuspicionsthatmyrecommendationsweregood,Iuseda
uniformbetapriortopreemptivelythwartcriticismsthatmypriorbiasedtheconclusions.
Asforthelikelihoodfunction,itisthesamefunctionfamilyweusedforthecoinflips
(justwithdifferentparameters).
Itturnsoutthat36outofthe40peoplepreferredmyrecommendationstotherandomones
(threelikedthemboththesame,andoneweirdolikedtherandomonesbetter).The
posteriordistribution,therefore,wasabetadistributionwithparameters37and5.
>curve(dbeta(x,37,5),xlab="θ",
+ylab="posteriorbelief",
+type="l",yaxt='n')
Figure7.9:Theposteriordistributionoftheeffectivenessofmyrecommendationsusinga
uniformprior
Again,theendresultoftheBayesiananalysisproperistheposteriordistributionthat
illustratescrediblevaluesoftheparameter.Thedecisiontosetanarbitrarythresholdfor
concludingthatmyrecommendationswereeffectiveornotisaseparatematter.
Let’ssaythat,beforethefact,westatedthatif.05orlowerwerenotamongthe95%most
crediblevalues,wewouldconcludethatmyrecommendationswereeffective.Howdowe
knowwhatthecredibleintervalboundsare?
Eventhoughitisrelativelystraightforwardtodeterminetheboundsofthecredible
intervalanalytically,doingsoourselvescomputationallywillhelpusunderstandhowthe
posteriordistributionissummarizedintheexamplesgivenlaterinthischapter.
Tofindthebounds,wewillsamplefromabetadistributionwithhyper-parameters37and
5thousandsoftimesandfindthequantilesat.025and.975.
>samp<-rbeta(10000,37,5)
>quantile(samp,c(.025,.975))
2.5%97.5%
0.76745910.9597010
Neat!Withthepreviousplotalreadyup,wecanaddlinestotheplotindicatingthis95%
credibleinterval,likeso:
#horizontalline
>lines(c(.767,.96),c(0.1,0.1)
>#tinyverticalleftboundary
>lines(c(.767,.769),c(0.15,0.05))
>#tinyverticalrightboundary
>lines(c(.96,.96),c(0.15,0.05))
Ifyouplotthisyourself,you’llseethateventhelowerboundisfarfromthedecision
boundary—itlookslikemyworkwasworthitafterall!
Thetechniqueofsamplingfromadistributionmanymanytimestoobtainnumerical
resultsisknownasMonteCarlosimulation.
EnterMCMC–stageleft
Asmentionedearlier,westartedwiththecoinflipexamplesbecauseoftheeaseof
determiningtheposteriordistributionanalytically—primarilybecauseofthebeta
distribution’sself-conjugacywithrespecttothebinomiallikelihoodfunction.
Itturnsoutthatmostreal-worldBayesiananalysesrequireamorecomplicatedsolution.In
particular,thehyper-parametersthatdefinetheposteriordistributionarerarelyknown.
Whatcanbedeterminedistheprobabilitydensityintheposteriordistributionforeach
parametervalue.Theeasiestwaytogetasenseoftheshapeoftheposterioristosample
fromitmanythousandsoftimes.Morespecifically,wesamplefromallpossible
parametervaluesandrecordtheprobabilitydensityatthatpoint.
Howdowedothis?Well,inthecaseofjustoneparametervalue,it’soften
computationallytractabletojustrandomlysamplewilly-nillyfromthespaceofall
possibleparametervalues.ForcaseswhereweareusingBayesiananalysistodetermine
thecrediblevaluesfortwoparameters,thingsgetalittlemorehairy.
Theposteriordistributionformorethanoneparametervalueisacalledajoint
distribution;inthecaseoftwoparameters,itis,morespecifically,abivariatedistribution.
OnesuchbivariatedistributioncanbeseeninFigure7.10:
Figure7.10:Abivariatenormaldistribution
Topicturewhatitisliketosampleabivariateposterior,imagineplacingabelljarontop
ofapieceofgraphpaper(becarefultomakesureEsterGreenwoodisn’tunderthere!).We
don’tknowtheshapeofthebelljarbutwecan,foreachintersectionofthelinesinthe
graphpaper,findtheheightofthebelljaroverthatexactpoint.Clearly,thesmallerthe
gridonthegraphpaper,thehigherresolutionourestimateoftheposteriordistributionis.
Notethatintheunivariatecase,weweresamplingfromnpoints,inthebivariatecase,we
aresamplingfrom points(npointsforeachaxis).Formodelswithmorethantwo
parameters,itissimplyintractabletousethisrandomsamplingmethod.Luckily,there’sa
betteroptionthanjustrandomlysamplingtheparameterspace:MarkovChainMonte
Carlo(MCMC).
IthinktheeasiestwaytogetasenseofwhatMCMCis,isbylikeningittothegamehot
andcold.Inthisgame—whichyoumayhaveplayedasachild—anobjectishiddenanda
searcherisblindfoldedandtaskedwithfindingthisobject.Asthesearcherwanders
around,theotherplayertellsthesearcherwhethersheishotorcold;hotifsheisnearthe
object,coldwhensheisfarfromtheobject.Theotherplayeralsoindicateswhetherthe
movementofthesearcherisgettingherclosertotheobject(gettingwarmer)orfurther
fromtheobject(gettingcooler).
Inthisanalogy,warmregionsareareasweretheprobabilitydensityoftheposterior
distributionishigh,andcoolregionsaretheareaswerethedensityislow.Putinthisway,
randomsamplingislikethesearcherteleportingtorandomplacesinthespacewherethe
otherplayerhidtheobjectandjustrecordinghowhotorcolditisatthatpoint.Theguided
behavioroftheplayerwedescribedbeforeisfarmoreefficientatexploringtheareasof
interestinthespace.
Atanyonepoint,theblindfoldedsearcherhasnomemoryofwhereshehasbeenbefore.
Hernextpositiononlydependsonthepointsheisatcurrently(andthefeedbackofthe
otherplayer).Amemory-lesstransitionprocesswherebythenextpositiondependsonly
uponthecurrentposition,andnotonanypreviouspositions,iscalledaMarkovchain.
Thetechniquefordeterminingtheshapeofhigh-dimensionalposteriordistributionsis
thereforecalledMarkovchainMonteCarlo,becauseitusesMarkovchainstointelligently
samplemanytimesfromtheposteriordistribution(MonteCarlosimulation).
ThedevelopmentofsoftwaretoperformMCMConcommodityhardwareis,forthemost
part,responsibleforaBayesianrenaissanceinrecentdecades.Problemsthatwere,nottoo
longago,completelyintractablearenowpossibletobeperformedonevenrelativelylowpoweredcomputers.
ThereisfarmoretoknowaboutMCMCthenwehavethespacetodiscusshere.Luckily,
wewillbeusingsoftwarethatabstractssomeofthesedeepertopicsawayfromus.
Nevertheless,ifyoudecidetouseBayesianmethodsinyourownanalyses(andIhopeyou
do!),I’dstronglyrecommendconsultingresourcesthatcanaffordtodiscussMCMCata
deeperlevel.Therearemanysuchresources,availableforfree,ontheweb.
Beforewemoveontoexamplesusingthismethod,itisimportantthatwebringupthis
onelastpoint:Mathematically,aninfinitelylongMCMCchainwillgiveusaperfect
pictureoftheposteriordistribution.Unfortunately,wedon’thaveallthetimeintheworld
(universe[?]),andwehavetosettleforafinitenumberofMCMCsamples.Thelonger
ourchains,themoreaccuratethedescriptionoftheposterior.Asthechainsgetlongerand
longer,eachnewsampleprovidesasmallerandsmalleramountofnewinformation
(economistscallthisdiminishingmarginalreturns).ThereisapointintheMCMC
samplingwherethedescriptionoftheposteriorbecomessufficientlystable,andforall
practicalpurposes,furthersamplingisunnecessary.Itisatthispointthatwesaythechain
converged.Unfortunately,thereisnoperfectguaranteethatourchainhasachieved
convergence.OfallthecriticismsofusingBayesianmethods,thisisthemostlegitimate—
butonlyslightly.
Therearereallyeffectiveheuristicsfordeterminingwhetherarunningchainhas
converged,andwewillbeusingafunctionthatwillautomaticallystopsamplingthe
posterioronceithasachievedconvergence.Further,convergencecanbeallbutperfectly
verifiedbyvisualinspection,aswe’llseesoon.
Forthesimplemodelsinthischapter,noneofthiswillbeaproblem,anyway.
UsingJAGSandrunjags
Althoughit’sabitsillytobreakoutMCMCforthesingle-parametercareer
recommendationanalysisthatwediscussedearlier,applyingthismethodtothissimple
examplewillaidinitsusageformorecomplicatedmodels.
Inordertogetstarted,youneedtoinstallasoftwareprogramcalledJAGS,whichstands
forJustAnotherGibbsSampler(aGibbssamplerisatypeofMCMCsampler).This
programisindependentofR,butwewillbeusingRpackagestocommunicatewithit.
AfterinstallingJAGS,youwillneedtoinstalltheRpackagesrjags,runjags,and
modeest.Asareminder,youcaninstallallthreewiththiscommand:
>install.packages(c("rjags","runjags","modeest"))
Tomakesureeverythingisinstalledproperly,loadtherunjagspackage,andrunthe
functiontestjags().Myoutputlookssomethinglikethis:
>library(runjags)
>testjags()
YouareusingRversion3.2.1(2015-06-18)onaunixmachine,
withtheRStudioGUI
Therjagspackageisinstalled
JAGSversion3.4.0foundsuccessfullyusingthecommand
'/usr/local/bin/jags'
Thefirststepistocreatethemodelthatdescribesourproblem.Thismodeliswritteninan
R-likesyntaxandstoredinastring(charactervector)thatwillgetsenttoJAGSto
interpret.Forthisproblem,wewillstorethemodelinastringvariablecalledour.model,
andthemodellookslikethis:
our.model<-"
model{
#likelihoodfunction
numSuccesses~dbinom(successProb,numTrials)
#prior
successProb~dbeta(1,1)
#parameterofinterest
theta<-numSuccesses/numTrials
}"
NotethattheJAGSsyntaxallowsforR-stylecomments,whichIincludedforclarity.
Inthefirstfewlinesofthemodel,wearespecifyingthelikelihoodfunction.Asweknow,
thelikelihoodfunctioncanbedescribedwithabinomialdistribution.Theline:
numSuccesses~dbinom(successProb,numTrials)
saysthevariablenumSuccessesisdistributedaccordingtothebinomialfunctionwith
hyper-parametersgivenbyvariablesuccessProbandnumTrials.
Inthenextrelevantline,wearespecifyingourchoiceofthepriordistribution.Inkeeping
withourpreviouschoice,thislinereads,roughly:thesuccessProbvariable(referredtoin
thepreviousrelevantline)isdistributedinaccordancewiththebetadistributionwith
hyper-parameters1and1.
Inthelastline,wearespecifyingthattheparameterwearereallyinterestedinisthe
proportionofsuccesses(numberofsuccessesdividedbythenumberoftrials).Weare
callingthattheta.Noticethatweusedthedeterministicassignmentoperator(<-)instead
ofthedistributedaccordingtooperator(~)toassigntheta.
ThenextstepistodefinethesuccessProbandnumTrialsvariablesforshippingtoJAGS.
WedothisbystuffingthesevariablesinanRlist.Wedothisasfollows:
our.data<-list(
numTrials=40,
successProb=36/40
)
Great!WeareallsettoruntheMCMC.
>results<-autorun.jags(our.model,
+data=our.data,
+n.chains=3,
+monitor=c('theta'))
ThefunctionthatrunstheMCMCsamplerandautomaticallystopsatconvergenceis
autorun.jags.ThefirstargumentisthestringspecifyingtheJAGSmodel.Next,wetell
thefunctionwheretofindthedatathatJAGSwillneed.Afterthis,wespecifythatwe
wanttorun3independentMCMCchains;thiswillhelpguaranteeconvergenceand,ifwe
runtheminparallel,drasticallycutdownonthetimewehavetowaitforoursamplingto
bedone.(Toseesomeoftheotheroptionsavailable,asalways,youcanrun?
autorun.jags.)Lastly,wespecifythatweareinterestedinthevariable‘theta’.
Afterthisisdone,wecandirectlyplottheresultsvariablewheretheresultsofthe
MCMCarestored.TheoutputofthiscommandisshowninFigure7.11.
>plot(results,
+plot.type=c("histogram","trace"),
+layout=c(2,1))
Figure7.11:OutputplotsfromtheMCMCresults.Thetopisatraceplotofthetavalues
alongthechain’slength.Thebottomisabarplotdepictingtherelativecredibilityof
differentthetavalues.
Thefirstoftheseplotsiscalledatraceplot.Itshowsthesampledvaluesofthetaasthe
chaingotlonger.Thefactthatallthreechainsareoverlappingaroundthesamesetof
valuesis,atleastinthiscase,astrongguaranteethatallthreechainshaveconverged.The
bottomplotisabarplotthatdepictstherelativecredibilityofdifferentvaluesoftheta.Itis
shownhereasabarplot,andnotasmoothcurve,becausethebinomiallikelihoodfunction
isdiscrete.Ifwewantacontinuousrepresentationoftheposteriordistribution,wecan
extractthesamplevaluesfromtheresultsandplotitasadensityplotwithasufficiently
largebandwidth:
>#mcmcsamplesarestoredinmcmcattribute
>#ofresultsvariable
>results.matrix<-as.matrix(results$mcmc)
>
>#extractthesamplesfor'theta'
>#theonlycolumn,inthiscase
>theta.samples<-results.matrix[,'theta']
>
>plot(density(theta.samples,adjust=5))
Andwecanaddtheboundsofthe95%credibleintervaltotheplotasbefore:
>quantile(theta.samples,c(.025,.975))
2.5%97.5%
0.8000.975
>lines(c(.8,.975),c(0.1,0.1))
>lines(c(.8,.8),c(0.15,0.05))
>lines(c(.975,.975),c(0.15,0.05))
Figure7.12:Densityplotoftheposteriordistribution.Notethatthex-axisstartshereat
0.6
Restassuredthatthereisonlyadisagreementbetweenthetwocredibleintervals’bounds
inthisexamplebecausetheMCMCcouldonlysamplediscretevaluesfromtheposterior
sincethelikelihoodfunctionisdiscrete.Thiswillnotoccurintheotherexamplesinthis
chapter.Regardless,thetwomethodsseemtobeinagreementabouttheshapeofthe
posteriordistributionandthecrediblevaluesoftheta.Itisallbutcertainthatmy
recommendationsarebetterthanchance.Gome!
FittingdistributionstheBayesianway
Inthisnextexample,wearegoingtobefittinganormaldistributiontotheprecipitation
datasetthatweworkedwithinthepreviouschapter.WewillwrapupwithBayesian
analoguetotheonesamplet-test.
Theresultswewantfromthisanalysisarecrediblevaluesofthetruepopulationmeanof
theprecipitationdata.Referbacktothepreviouschaptertorecallthatthesamplemean
was34.89.Inaddition,wewillalsobedeterminingcrediblevaluesofthestandard
deviationoftheprecipitationdata.Sinceweareinterestedinthecrediblevaluesoftwo
parameters,ourposteriordistributionisajointdistribution.
Ourmodelwilllookalittledifferentlynow:
the.model<-"
model{
mu~dunif(0,60)#prior
stddev~dunif(0,30)#prior
tau<-pow(stddev,-2)
for(iin1:theLength){
samp[i]~dnorm(mu,tau)#likelihoodfunction
}
}"
Thistime,wehavetosettwopriors,oneforthemeanoftheGaussiancurvethatdescribes
theprecipitationdata(mu),andoneforthestandarddeviation(stddev).Wealsohaveto
createavariablecalledtauthatdescribestheprecision(inverseofthevariance)ofthe
curve,becausednorminJAGStakesthemeanandtheprecisionashyper-parameters(and
notthemeanandstandarddeviation,likeR).Wespecifythatourpriorforthemu
parameterisuniformlydistributedfrom0inchesofrainto60inchesofrain—farabove
anyreasonablevalueforthepopulationprecipitationmean.Wealsospecifythatourprior
forthestandarddeviationisaflatonefrom0to30.Ifthiswerepartofanymeaningful
analysisandnotjustapedagogicalexample,ourpriorswouldbeinformedinpartby
precipitationdatafromotherregionsliketheUSormyprecipitationdatafromprevious
years.JAGScomeschockfullofdifferentfamiliesofdistributionsforexpressingdifferent
priors.
Next,wespecifythatthevariablesamp(whichwillholdtheprecipitationdata)is
distributednormallywithunknownparametersmuandtau.
Then,weconstructanRlisttoholdthevariablestosendtoJAGS:
the.data<-list(
samp=precip,
theLength=length(precip)
)
Cool,let’srunit!Onmycomputer,thistakes5seconds.
>results<-autorun.jags(the.model,
+data=the.data,
+n.chains=3,
+#nowwecareabouttwoparameters
+monitor=c('mu','stddev'))
Let’splottheresultsdirectlylikebefore,whilebeingcarefultoplotboththetraceplotand
histogramfrombothparametersbyincreasingthelayoutargumentinthecalltotheplot
function.
>plot(results,
+plot.type=c("histogram","trace"),
+layout=c(2,2))
Figure7.13:OutputplotsfromtheMCMCresultoffittinganormalcurvetothebuilt-in
precipitationdataset
Figure7.14showsthedistributionofcrediblevaluesofthemuparameterwithout
referencetothestddevparameter.Thisiscalledamarginaldistribution.
Figure7.14:Marginaldistributionofposteriorforparameter‘mu’.Dashedlineshows
hypotheticalpopulationmeanwithin95%credibleinterval
Rememberwhen,inthelastchapter,wewantedtodeterminewhethertheUS’mean
precipitationwassignificantlydiscrepantfromthe(hypothetical)knownpopulationmean
precipitationoftherestoftheworldof38inches.Ifwetakeanyvalueoutsidethe95%
credibleintervaltoindicatesignificance,then,justlikewhenweusedtheNHSTt-test,we
havetorejectthehypothesisthatthereissignificantlymoreorlessrainintheUSthanin
therestoftheworld.
Beforewemoveontothenextexample,youmaybeinterestedincrediblevaluesforboth
themeanandthestandarddeviationatthesametime.Agreattypeofplotfordepicting
thisinformationisacontourplot,whichillustratestheshapeofathree-dimensional
surfacebyshowingaseriesoflinesforwhichthereisequalheight.InFigure7.15,each
lineshowstheedgesofasliceoftheposteriordistributionthatallhaveequalprobability
density.
>results.matrix<-as.matrix(results$mcmc)
>
>library(MASS)
>#weneedtomakeakerneldensity
>#estimateofthe3-dsurface
>z<-kde2d(results.matrix[,'mu'],
+results.matrix[,'stddev'],
+n=50)
>
>plot(results.matrix)
>contour(z,drawlabels=FALSE,
+nlevels=11,col=rainbow(11),
+lwd=3,add=TRUE)
Figure7.15:Contourplotofthejointposteriordistribution.Thepurplecontour
correspondstotheregionwiththehighestprobabilitydensity
Thepurplecontours(theinner-mostcontours)showtheregionoftheposteriorwiththe
highestprobabilitydensity.Thesecorrespondtothemostlikelyvaluesofourtwo
parameters.Asyoucansee,themostlikelyvaluesoftheparametersforthenormal
distributionthatbestdescribesourpresentknowledgeofUSprecipitationareameanofa
littlelessthan35andastandarddeviationofalittlelessthan14.Wecancorroboratethe
resultsofourvisualinspectionbydirectlyprintingtheresultsvariable:
>print(results)
JAGSmodelsummarystatisticsfrom30000samples(chains=3;adapt+burnin
=5000):
Lower95MedianUpper95MeanSDMode
mu31.64534.86238.18134.8661.663934.895
stddev11.66913.88616.37613.9671.212213.773
MCerrMC%ofSDSSeffAC.10psrf
mu0.0122380.7184840.0026841.0001
stddev0.00939510.816649-0.00535881.0001
Totaltimetaken:5seconds
whichalsoshowsothersummarystatisticsfromourMCMCsamplesandsome
informationabouttheMCMCprocess.
TheBayesianindependentsamplest-test
Forourlastexampleinthechapter,wewillbeperformingasort-ofBayesiananalogueto
thetwo-samplet-testusingthesamedataandproblemfromthecorrespondingexamplein
thepreviouschapter—testingwhetherthemeansofthegasmileageforautomaticand
manualcarsaresignificantlydifferent.
Note
ThereisanotherpopularBayesianalternativetoNHST,whichusessomethingcalled
Bayesfactorstocomparethelikelihoodofthenullandalternativehypotheses.
Asbefore,let’sspecifythemodelusingnon-informativeflatpriors:
the.model<-"
model{
#eachgroupwillhaveaseparatemu
#andstandarddeviation
for(jin1:2){
mu[j]~dunif(0,60)#prior
stddev[j]~dunif(0,20)#prior
tau[j]<-pow(stddev[j],-2)
}
for(iin1:theLength){
#likelihoodfunction
y[i]~dnorm(mu[x[i]],tau[x[i]])
}
}"
Noticethattheconstructthatdescribesthelikelihoodfunctionisalittledifferentnow;we
havetousenestedsubscriptsforthemuandtauparameterstotellJAGSthatweare
dealingwithtwodifferentversionsofmuandstddev.
Next,thedata:
the.data<-list(
y=mtcars$mpg,
#'x'needstostartat1so
#1isnowautomaticand2ismanual
x=ifelse(mtcars$am==1,1,2),
theLength=nrow(mtcars)
)
Finally,let’sroll!
>results<-autorun.jags(the.model,
+data=the.data,
+n.chains=3,
+monitor=c('mu','stddev'))
Let’sextractthesamplesforboth‘mu’sandmakeavectorthatholdsthedifferencesinthe
musamplesbetweeneachofthetwogroups.
>results.matrix<-as.matrix(results$mcmc)
>difference.in.means<-(results.matrix[,1]–
+results.matrix[,2])
Figure7.16showsaplotofthecredibledifferencesinmeans.Thelikelydifferencesin
meansarefaraboveadifferenceofzero.Weareallbutcertainthatthemeansofthegas
mileageforautomaticandmanualcarsaresignificantlydifferent.
Figure7.16:Crediblevaluesforthedifferenceinmeansofthegasmileagebetween
automaticandmanualcars.Thedashedlineisatadifferenceofzero
Noticethatthedecisiontomimictheindependentsamplest-testmadeusfocusonone
particularpartoftheBayesiananalysisanddidn’tallowustoappreciatesomeoftheother
veryvaluableinformationtheanalysisyielded.Forexample,inadditiontohavinga
distributionillustratingcredibledifferencesinmeans,wehavetheposteriordistribution
forthecrediblevaluesofboththemeansandstandarddeviationsofbothsamples.The
abilitytomakeadecisiononwhetherthesamples’meansaresignificantlydifferentisnice
—theabilitytolookattheposteriordistributionoftheparametersisbetter.
Exercises
Practisethefollowingexercisestoreinforcetheconceptslearnedinthischapter:
WriteafunctionthatwilltakeavectorholdingMCMCsamplesforaparameterand
plotadensitycurvedepictingtheposteriordistributionandthe95%credibleinterval.
Becarefulofdifferentscalesonthey-axis.
Fittinganormalcurvetoanempiricaldistributionisconceptuallyeasy,butnotvery
robust.Fordistributionfittingthatismorerobusttooutliers,it’scommontouseatdistributioninsteadofthenormaldistribution,sincethethasheaviertails.Viewthe
distributionoftheshapeattributeofthebuilt-inrockdataset.Doesthislook
normallydistributed?Findtheparametersofanormalcurvethatisafittothedata.In
JAGS,dt,thet-distributiondensityfunction,takesthreeparameters:themean,the
precision,andthedegreesoffreedomthatcontrolstheheavinessofthetails.Findthe
parametersafterfittingat-distributiontothedata.Arethemeanssimilar?Which
estimateofthemeandoyouthinkismorerepresentativeofcentraltendency?
InTheseus’paradox,awoodenshipbelongingtoTheseushasdecayingboards,
whichareremovedandreplacedwithnewlumber.Eventually,alltheboardsinthe
originalshiphavebeenreplaced,sothattheshipismadeupofcompletelynew
matter.IsitstillTheseus’ship?Ifnot,atwhatpointdiditbecomeadifferentship?
WhatwouldAristotlesayaboutthis?AppealtothedoctrineoftheFourCauses.
WouldAristotle’sstancestillholdupif—asinThomasHobbes’versionofthe
paradox—theoriginaldecayingboardsweresavedandusedtomakeacomplete
replicaofTheseus’originalship?
Summary
Althoughmostintroductorydataanalysistextsdon’tevenbroachthetopicofBayesian
methods,you,dearreader,areversedenoughinthismattertostartapplyingthese
techniquestorealproblems.
WediscoveredthatBayesianmethodscould—atleastforthemodelsinthischapter—not
onlyallowustoanswerthesamekindsofquestionswemightusethebinomial,one
samplet-test,andtheindependentsamplest-testfor,butprovideamuchricherandmore
intuitivedepictionofouruncertaintyinourestimates.
Iftheseapproachesinterestyou,Iurgeyoutolearnmoreabouthowtoextendtheseto
supersedeotherNHSTtests.Ialsourgeyoutolearnmoreaboutthemathematicsbehind
MCMC.
Aswiththelastchapter,wecoveredmuchgroundhere.Ifyoumadeitthrough,
congratulations!
Thisconcludestheunitonconfirmatorydataanalysisandinferentialstatistics.Inthenext
unit,wewillbeconcernedlesswithestimatingparameters,andmoreinterestedin
prediction.Lastonethereisarottenegg!
Chapter8.PredictingContinuous
Variables
Nowthatwe’vefullycoveredintroductoryinferentialstatistics,we’renowgoingtoshift
ourattentiontooneofthemostexcitingandpracticallyusefultopicsindataanalysis:
predictiveanalytics.Throughoutthischapter,wearegoingtointroduceconceptsand
terminologyfromacloselyrelatedfieldcalledstatisticallearningor,asit’s(somehow)
morecommonlyreferredto,machinelearning.
Whereasinthelastunit,wewereusingdatatomakeinferencesabouttheworld,thisunit
isprimarilyaboutusingdatatomakeinferences(orpredictions)aboutotherdata.Onthe
surface,thismightnotsoundmoreappealing,butconsiderthefruitsofthisareaofstudy:
ifyou’veeverreceivedacallfromyourcreditcardcompanyaskingtoconfirma
suspiciouspurchasethatyou,infact,didnotmake,it’sbecausesophisticatedalgorithms
learnedyourpurchasingbehaviorandwereabletodetectdeviationfromthatpattern.
Sincethisisthefirstchapterleavinginferentialstatisticsanddelvingintopredictive
analytics,it’sonlynaturalthatwewouldstartwithatechniquethatisusedforbothends:
linearregression.
Atthesurfacelevel,linearregressionisamethodthatisusedbothtopredictthevalues
thatcontinuousvariablestakeon,andtomakeinferencesabouthowcertainvariablesare
relatedtoacontinuousvariable.Thesetwoprocedures,predictionandinference,
foundationallyrelyontheinformationfromstatisticalmodels.Statisticalmodelsare
idealizedrepresentationsofatheorymeanttoillustrateandexplainaprocessthat
generatesdata.Amodelisusuallyanequation,orseriesofequations,withsomenumber
ofparameters.
Throughoutthischapter,rememberthequote(generallyattributedto)GeorgeBox:
Allmodelsarewrongbutsomeareuseful.
Amodelairplaneorcarmightnotbetherealthing,butitcanhelpuslearnandunderstand
someprettypowerfulpropertiesoftheobjectthatisbeingmodeled.
Althoughlinearregressionis,atahighlevel,conceptuallyquitesimple,itisabsolutely
indispensabletomodernappliedstatistics,andathoroughunderstandingoflinearmodels
willpayenormousdividendsthroughoutyourcareerasananalyst.
Linearmodels
AsmallbakingoutfitinupstateNewYorkcalledNoSconeUnturnedkeepscareful
recordsofthebakedgoodsitproduces.TheleftpanelofFigure8.1isascatterplotof
diametersandcircumferences(incentimeters)ofNoSconeUnturned’scookies,and
depictstheirrelationship:
Figure8.1:(left)AscatterplotofdiametersandcircumferencesofNoSconeUnturned’s
cookies;(right)thesameplotwithabestfitregressionlineplottedoverthedatapoints
Astraightlineistheperfectthingtorepresentthisdata.Afterfittingastraightlinetothe
data,wecanmakepredictionsaboutthecircumferencesofcookiesthatwehaven’t
observed,like11or0.7(ifyouweren’tplayingtruantingradeschool,you’dknowthere’s
aconsistentandpredictablerelationshipbetweenthediameterofacircleandthecircle’s
circumference,namelyπ,butwe’llignorethatfornow).
YoumayhavelearnedthattheequationthatdescribesalineinaCartesianplaneis:
where isthey-intercept(theplacewherethelineintersectswiththeverticallineat
),and istheslope(describingthedirectionandsteepnessoftheline).Inlinear
regression,theequationdescribing asafunctionof iswrittenas:
where (sometimes )isthey-intercept,and (sometimes )istheslope.
Collectively,the sareknownasthebetacoefficients.
Theequationofthelinethatbestdescribesthisdatais:
making and 0andπrespectively.
Knowingthis,itiseasytopredictthecircumferencesofcookiesthatwehaven’tmeasured
yet.Thecircumferenceofthecookiewithadiameterof11centimetersis0+3.1415()11
or34.558andacookieof0.7centimetersis0+3.1415(0.7)or2.2.
Inpredictiveanalytics’parlance,thevariablethatwearetryingtopredictiscalledthe
dependent(or,sometimes,target)variable,becauseitsvaluesaredependentonother
variables.Thevariablesthatweusetopredictthedependentvariablearecalled
independent(or,sometimes,predictor)variables.
Beforemovingontoalesssillyexample,itisimportanttounderstandtheproper
interpretationoftheslope :itdescribeshowmuchthedependentvariableincreases(or
decreases)foreachunitincreaseoftheindependentvariable.Inthiscase,forevery
centimeterincreaseinacookie’sdiameter,thecircumferenceincreasesπcentimeters.In
contrast,anegative indicatesthatastheindependentvariableincreases,thedependent
variabledecreases.
Simplelinearregression
Ontoasubstantiallylesstrivialexample,let’ssayNoSconeUnturnedhasbeenkeeping
carefulrecordsofhowmanyraisins(ingrams)theyhavebeenusingfortheirfamous
oatmealraisincookies.Theywanttoconstructalinearmodeldescribingtherelationship
betweentheareaofacookie(incentimeterssquared)andhowmanyraisinstheyuse,on
average.
Inparticular,theywanttouselinearregressiontopredicthowmanygramsofraisinsthey
willneedfora1-meterlongoatmealraisincookie.Predictingacontinuousvariable
(gramsofraisins)fromothervariablessoundslikeajobforregression!Inparticular,when
weusejustasinglepredictorvariable(theareaofthecookies),thetechniqueiscalled
simplelinearregression.
TheleftpanelofFigure8.2illustratestherelationshipbetweentheareaofcookiesandthe
amountofraisinsitused.Italsoshowsthebest-fitregressionline:
Figure8.2:(left)AscatterplotofareasandgramsofraisinsinNoSconeUnturned’s
cookieswithabest-fitregressionline;(right)thesameplotwithhighlightedresiduals
Notethat,incontrasttothelastexample,virtuallynoneofthedatapointsactuallyreston
thebest-fitline—therearenowerrors.Thisisbecausethereisarandomcomponentto
howmanyraisinsareused.
TherightpanelofFigure8.2drawsdashedredlinesbetweeneachdatapointandwhatthe
best-fitlinewouldpredictistheamountofraisinsnecessary.Thesedashedlinesrepresent
theerrorintheprediction,andtheseerrorsarecalledresiduals.
Sofar,wehaven’tdiscussedhowthebest-fitlineisdetermined.Inessence,thelineofthe
bestfitwillminimizetheamountofdashedline.Morespecifically,theresidualsare
squaredandalladdedup—thisiscalledtheResidualSumofSquares(RSS).Theline
thatisthebestfitwillminimizetheRSS.Thismethodiscalledordinaryleastsquares,or
OLS.
LookatthetwoplotsinFigure8.3.Noticehowtheregressionlinesaredrawninways
thatclearlydonotminimizetheamountofredline.TheRSScanbefurtherminimizedby
increasingtheslopeinthefirstplot,anddecreasingitinthesecondplot:
Figure8.3:TworegressionlinesthatdonotminimizetheRSS
Nowthattherearedifferencesbetweentheobservedvaluesandthepredictedvalues—as
therewillbeineveryreal-lifelinearregressionyouperform—theequationthatdescribes
,thedependentvariable,changesslightly:
Theequationwithouttheresidualtermonlydescribesourprediction, ,pronouncedyhat
(becauseitlookslike iswearingalittlehat:)
Ourerrortermis,therefore,thedifferencebetweenthevaluethatourmodelpredictsand
theactualempiricalvalueforeachobservation :
Formally,theRSSis:
Recallthatthisisthetermthatgetsminimizedwhenfindingthebest-fitline.
IftheRSSisthesumofthesquaredresiduals(orerrorterms),themeanofthesquared
residualsisknownastheMeanSquaredError(MSE),andisaveryimportantmeasure
oftheaccuracyofamodel.
Formally,theMSEis:
Occasionally,youwillencountertheRootMeanSquaredError(RMSE)asameasure
ofmodelfit.ThisisjustthesquarerootoftheMSE,puttingitinthesameunitsasthe
dependentvariable(insteadofunitsofthedependentvariablesquared).Thedifference
betweentheMSEandRMSEislikethedifferencebetweenvarianceandstandard
deviation,respectively.Infact,inboththesecases(theMSE/RMSEand
variance/standard-deviation),theerrortermshavetobesquaredfortheverysamereason;
iftheywerenot,thepositiveandnegativeresidualswouldcanceleachotherout.
Nowthatwehaveabitoftherequisitemath,we’rereadytoperformasimplelinear
regressionourselves,andinterprettheoutput.Wewillbeusingthevenerablemtcarsdata
set,andtrytopredictacar’sgasmileage(mpg)withthecar’sweight(wt).Wewillalsobe
usingR’sbasegraphicssystem(notggplot2)inthissection,becausethevisualizationof
linearmodelsisarguablysimplerinbaseR.
First,let’splotthecars’gasmileageasafunctionoftheirweights:
>plot(mpg~wt,data=mtcars)
HereweemploytheformulasyntaxthatwewerefirstintroducedtoinChapter3,
DescribingRelationshipsandthatweusedextensivelyinChapter6,TestingHypotheses.
Wewillbeusingitheavilyinthischapteraswell.Asarefresher,mph~wtroughlyreads
mpgasafunctionofwt.
Next,let’srunasimplelinearregressionwiththelmfunction,andsaveittoavariable
calledmodel:
>model<-lm(mpg~wt,data=mtcars)
Nowthatwehavethemodelsaved,wecan,verysimply,addaplotofthelinearmodelto
thescatterplotwehavealreadycreated:
>abline(model)
Figure8.4:Theresultofplottingoutputfromlm
Finally,let’sviewtheresultoffittingthelinearmodelusingthesummaryfunction,and
interprettheoutput:
>summary(model)
Call:
lm(formula=mpg~wt,data=mtcars)
Residuals:
Min1QMedian3QMax
-4.5432-2.3647-0.12521.40966.8727
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)37.28511.877619.858<2e-16***
wt-5.34450.5591-9.5591.29e-10***
--Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
Residualstandarderror:3.046on30degreesoffreedom
MultipleR-squared:0.7528,AdjustedR-squared:0.7446
F-statistic:91.38on1and30DF,p-value:1.294e-10
Thefirstblockoftextremindsushowthemodelwasbuiltsyntax-wise(whichcan
actuallybeusefulinsituationswherethelmcallisperformeddynamically).
Next,weseeafive-numbersummaryoftheresiduals.Rememberthatthisisinunitsof
thedependentvariable.Inotherwords,thedatapointwiththehighestresidualis6.87
milespergallon.
Inthenextblock,labeledCoefficients,directyourattentiontothetwovaluesinthe
Estimatecolumn;thesearethebetacoefficientsthatminimizetheRSS.Specifically,
and
.Theequationthatdescribesthebest-fitlinearmodelthenis:
Remember,thewaytointerpretthe coefficientisforeveryunitincreaseofthe
independentvariable(it’sinunitsof1,000pounds),thedependentvariablegoesdown
(becauseit’snegative)5.345units(whicharemilespergallon).The coefficient
indicates,rathernonsensically,thatacarthatweighsnothingwouldhaveagasmileageof
37.285milespergallon.Recallthatallmodelsarewrong,butsomeareuseful.
Ifwewantedtopredictthegasmileageofacarthatweighed6,000pounds,ourequation
wouldyieldanestimateof5.125milespergallon.Insteadofdoingthemathbyhand,we
canusethepredictfunctionaslongaswesupplyitwithadataframethatholdsthe
relevantinformationfornewobservationsthatwewanttopredict:
>predict(model,newdata=data.frame(wt=6))
1
5.218297
Interestingly,wewouldpredictacarthatweighs7,000poundswouldget-0.126milesper
gallon.Again,allmodelsarewrong,butsomeareuseful.Formostreasonablecarweights,
ourverysimplemodelyieldsreasonablepredictions.
Ifwewereonlyinterestedinprediction—andonlyinterestedinthisparticularmodel—we
wouldstophere.But,asImentionedinthischapter’spreface,linearregressionisalsoa
toolforinference—andaprettypowerfuloneatthat.Infact,wewillsoonseethatmany
ofthestatisticaltestswewereintroducedtoinChapter6,TestingHypothesescanbe
equivalentlyexpressedandperformedasalinearmodel.
Whenviewinglinearregressionasatoolofinference,it’simportanttorememberthatour
coefficientsareactuallyjustestimates.Thecarsobservedinmtcarsrepresentjustasmall
sampleofallextantcars.Ifsomehowweobservedallcarsandbuiltalinearmodel,the
betacoefficientswouldbepopulationcoefficients.ThecoefficientsthatweaskedRto
calculatearebestguessesbasedonoursample,and,justlikeourotherestimatesin
previouschapters,theycanundershootorovershootthepopulationcoefficients,andtheir
accuracyisafunctionoffactorssuchasthesamplesize,therepresentativenessofour
sample,andtheinherentvolatilityornoisinessofthesystemwearetryingtomodel.
Asestimates,wecanquantifyouruncertaintyinourbetacoefficientsusingstandard
error,asintroducedinChapter5,UsingDatatoReasonAbouttheWorld.Thecolumnof
valuesdirectlytotherightoftheEstimatecolumn,labeledStd.Error,givesusthese
measures.Theestimatesofthebetacoefficientsalsohaveasamplingdistributionand,
therefore,confidenceintervalscouldbeconstructedforthem.
Finally,becausethebetacoefficientshavewelldefinedsamplingdistributions(aslongas
certainsimplifyingassumptionsholdtrue),wecanperformhypothesistestsonthem.The
mostcommonhypothesistestperformedonbetacoefficientsaskswhethertheyare
significantlydiscrepantfromzero.Semantically,ifabetacoefficientissignificantly
discrepantfromzero,itisanindicationthattheindependentvariablehasasignificant
impactonthepredictionofthedependentvariable.Rememberthelong-runningwarning
inChapter6,TestingHypothesesthough:justbecausesomethingissignificantdoesn’t
meanitisimportant.
Thehypothesistestscomparingthecoefficientstozeroyieldp-values;thosep-valuesare
depictedinthefinalcolumnoftheCoefficientssection,labeledPr(>|t|).Weusually
don’tcareaboutthesignificanceoftheinterceptcoefficient(b0),sowecanignorethat.
Ratherimportantly,thep-valueforthecoefficientbelongingtothewtvariableisnear
zero,indicatingthattheweightofacarhassomepredictivepoweronthegasmileageof
thatcar.
Gettingbacktothesummaryoutput,directyourattentiontotheentrycalledMultipleRsquared.R-squared—also
orcoefficientofdetermination—is,likeMSE,ameasureof
howgoodofafitthemodelis.IncontrasttotheMSEthough,whichisinunitsofthe
dependentvariable, isalwaysbetween0and1,andthus,canbeinterpretedmore
easily.Forexample,ifwechangedtheunitsofthedependentvariablefrommilesper
gallontomilesperliter,theMSEwouldchange,butthe wouldnot.
An of1indicatesaperfectfitwithnoresidualerror,andan of0indicatestheworst
possiblefit:theindependentvariabledoesn’thelppredictthedependentvariableatall.
Figure8.5:Linearmodels(fromlefttoright)withsof0.75,0.33,and0.92
Helpfully,the isdirectlyinterpretableastheamountofvarianceinthedependent
variablethatisexplainedbytheindependentvariable.Inthiscase,forexample,theweight
ofacarexplainsabout75.3%ofthevarianceofthegasmileage.Whether75%constitutes
agood dependsheavilyonthedomain,butinmyfield(thebehavioralsciences),an
of75%isreallygood.
Wewillhavetocomebacktotherestofinformationinthesummaryoutputinthesection
aboutmultipleregression.
Note
Takenoteofthefactthatthep-valueoftheF-statisticinthelastlineoftheoutputisthe
sameasthep-valueofthet-statisticoftheonlynon-interceptcoefficient.
Simplelinearregressionwithabinary
predictor
Oneofthecoolestthingsaboutlinearregressionisthatwearenotlimitedtousing
predictorvariablesthatarecontinuous.Forexample,inthelastsection,weusedthe
continuousvariablewt(weight)topredictmilespergallon.Butlinearmodelsare
adaptabletousingcategoricalvariables,likeam(automaticormanualtransmission)as
well.
Normally,inthesimplelinearregressionequation
, willholdtheactualvalue
ofthepredictorvariable.Inthecaseofasimplelinearregressionwithabinarypredictor
(likeam), willholdadummyvariableinstead.Specifically,whenthepredictoris
automatic, willbe0,andwhenthepredictorismanual, willbe1.
Moreformally:
Putinthismanner,theinterpretationofthecoefficientschangesslightly,sincethe
bezerowhenthecarisautomatic, isthemeanmilespergallonforautomaticcars.
Similarly,since willequal whenthecarismanual, isequaltothemean
differenceinthegasmileagebetweenautomaticandmanualcars.
Concretely:
>model<-lm(mpg~am,data=mtcars)
>summary(model)
Call:
lm(formula=mpg~am,data=mtcars)
Residuals:
Min1QMedian3QMax
-9.3923-3.0923-0.29743.24399.5077
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)17.1471.12515.2471.13e-15***
am7.2451.7644.1060.000285***
--Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
will
Residualstandarderror:4.902on30degreesoffreedom
MultipleR-squared:0.3598,AdjustedR-squared:0.3385
F-statistic:16.86on1and30DF,p-value:0.000285
>
>
>mean(mtcars$mpg[mtcars$am==0])
[1]17.14737
>(mean(mtcars$mpg[mtcars$am==1])-
+mean(mtcars$mpg[mtcars$am==0]))
[1]7.244939
Theinterceptterm, is7.15,whichisthemeangasmileageoftheautomaticcars,and
is7.24,whichisthedifferenceofthemeansbetweenthetwogroups.
Theinterpretationofthet-statisticandp-valueareveryspecialnow;ahypothesistest
checkingtoseeif (thedifferenceingroupmeans)issignificantlydifferentfromzerois
tantamounttoahypothesistesttestingequalityofmeans(thestudentst-test)!Indeed,the
t-statisticandp-valuesarethesame:
#usevar.equaltochooseStudentst-test
#overWelch'st-test
>t.test(mpg~am,data=mtcars,var.equal=TRUE)
TwoSamplet-test
data:mpgbyam
t=-4.1061,df=30,p-value=0.000285
alternativehypothesis:truedifferenceinmeansisnotequalto0
95percentconfidenceinterval:
-10.84837-3.64151
sampleestimates:
meaningroup0meaningroup1
17.1473724.39231
Isn’tthatneat!?Atwo-sampletestofequalityofmeanscanbeequivalentlyexpressedasa
linearmodel!Thisbasicideacanbeextendedtohandlenon-binarycategoricalvariables
too—we’llseethisinthesectiononmultipleregression.
Notethatinmtcars,theamcolumnwasalreadycodedas1s(manuals)and0s
(automatics).Ifautomaticcarsweredummycodedas1andmanualsweredummycoded
as0,theresultswouldsemanticallybethesame;theonlydifferenceisthat wouldbe
themeanofmanualcars,and wouldbethe(negative)differenceinmeans.The
p-valueswouldbethesame.
and
Ifyouareworkingwithadatasetthatdoesn’talreadyhavethebinarypredictordummy
coded,R’slmcanhandlethistoo,solongasyouwrapthecolumninacalltofactor.For
example:
>mtcars$automatic<-ifelse(mtcars$am==0,"yes","no")
>model<-lm(mpg~factor(automatic),data=mtcars)
>model
Call:
lm(formula=mpg~factor(automatic),data=mtcars)
Coefficients:
(Intercept)factor(automatic)yes
24.392-7.245
Finally,notethatacarbeingautomaticormanualexplainssomeofthevarianceingas
mileage,butfarlessthanweightdid:thismodel’s isonly0.36.
Awordofwarning
Beforewemoveon,awordofwarning:thefirstpartofeveryregressionanalysisshould
betoplottherelevantdata.Toconvinceyouofthis,considerAnscombe’squartetdepicted
inFigure8.6
Figure8.6:Fourdatasetswithidenticalmeans,standarddeviations,regression
coefficients,and
Anscombe’squartetholdsfourx-ypairsthathavethesamemean,standarddeviation,
correlationcoefficients,linearregressioncoefficients,and
.Inspiteofthese
similarities,allfourofthesedatapairsareverydifferent.Itisawarningtonotblindly
applystatisticsondatathatyouhaven’tvisualized.Itisalsoawarningtotakelinear
regressiondiagnostics(whichwewillgooverbeforethechapter’send)seriously.
Onlytwoofthex-ypairsinAnscombe’squartetcanbemodeledwithsimplelinear
regression:theonesintheleftcolumn.Ofparticularinterestistheoneonthebottomleft;
itlookslikeitcontainsanoutlier.Afterthoroughinvestigationintowhythatdatummade
itintoourdataset,ifwedecidewereallyshoulddiscardit,wecaneither(a)removethe
offendingrow,or(b)userobustlinearregression.
Foramoreorlessdrop-inreplacementforlmthatusesarobustversionofOLScalled
IterativelyRe-WeightedLeastSquares(IWLS),youcanusetherlmfunctionfromthe
MASSpackage:
>library(MASS)
>data(anscombe)
>plot(y3~x3,data=anscombe)
>abline(lm(y3~x3,data=anscombe),
+col="blue",lty=2,lwd=2)
>abline(rlm(y3~x3,data=anscombe),
+col="red",lty=1,lwd=2)
Figure8.7:ThedifferencebetweenlinearregressionfitwithOLSandarobustlinear
regressionfittedwithIWLS
Note
OK,onemorewarning
Somesuggestthatyoushouldalmostalwaysuserlminfavoroflm.It’struethatrlmisthe
bee’sknees,butthereisasubtledangerindoingthisasillustratedbythefollowing
statisticalurbanlegend.
Sometimein1984,NASAwasstudyingtheozoneconcentrationsfromvariouslocations.
NASAusedrobuststatisticalmethodsthatautomaticallydiscardedanomalousdatapoints
believingmostofthemtobeinstrumenterrorsorerrorsintransmission.Asaresultof
this,someextremelylowozonereadingsintheatmosphereaboveAntarcticawere
removedfromNASA’satmosphericmodels.Theverynextyear,Britishscientists
publishedapaperdescribingaverydeterioratedozonelayerintheAntarctic.HadNASA
paidcloserattentiontooutliers,theywouldhavebeenthefirsttodiscoverit.
Itturnsoutthattherelevantpartofthisstoryisamyth,butthefactthatitissowidely
believedisatestamenttohowpossibleitis.
Thepointis,outliersshouldalwaysbeinvestigatedandnotsimplyignored,becausethey
maybeindicativeofpoormodelchoice,faultyinstrumentation,oragiganticholeinthe
ozonelayer.Oncetheoutliersareaccountedfor,thenuserobustmethodstoyourheart’s
content.
Multipleregression
Moreoftenthannot,wewanttoincludenotjustone,butmultiplepredictors(independent
variables),inourpredictivemodels.Luckily,linearregressioncaneasilyaccommodate
us!Thetechnique?Multipleregression.
Bygivingeachpredictoritsveryownbetacoefficientinalinearmodel,thetargetvariable
getsinformedbyaweightedsumofitspredictors.Forexample,amultipleregression
usingtwopredictorvariableslookslikethis:
Now,insteadofestimatingtwocoefficients( and ),weareestimatingthree:the
intercept,theslopeofthefirstpredictor,andtheslopeofthesecondpredictor.
Beforeexplainingfurther,let’sperformamultipleregressionpredictinggasmileagefrom
weightandhorsepower:
>model<-lm(mpg~wt+hp,data=mtcars)
>summary(model)
Call:
lm(formula=mpg~wt+hp,data=mtcars)
Residuals:
Min1QMedian3QMax
-3.941-1.600-0.1821.0505.854
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)37.227271.5987923.285<2e-16***
wt-3.877830.63273-6.1291.12e-06***
hp-0.031770.00903-3.5190.00145**
--Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
Residualstandarderror:2.593on29degreesoffreedom
MultipleR-squared:0.8268,AdjustedR-squared:0.8148
F-statistic:69.21on2and29DF,p-value:9.109e-12
Sincewearenowdealingwiththreevariables,thepredictivemodelcannolongerbe
visualizedwithaline;itmustbevisualizedasaplanein3Dspace,asseeninFigure8.8:
Figure8.8:Thepredictionregionthatisformedbyatwo-predictorlinearmodelisaplane
Aidedbythevisualization,wecanseethatbothourpredictionsofmpgareinformedby
bothwtandhp.Bothofthemcontributenegativelytothegasmileage.Youcanseethis
fromthefactthatthecoefficientsarebothnegative.Visually,wecanverifythisbynoting
thattheplaneslopesdownwardaswtincreasesandashpincreases,althoughtheslopefor
thelaterpredictorislessdramatic.
Althoughwelosetheabilitytoeasilyvisualizeit,thepredictionregionformedbyamorethan-twopredictorlinearmodeliscalledahyperplane,andexistsinn-dimensionalspace
wherenisthenumberofpredictorvariablesplus1.
Theastutereadermayhavenoticedthatthebetacoefficientbelongingtothewtvariable
isnotthesameasitwasinthesimplelinearregression.Thebetacoefficientforhp,too,is
differentthantheoneestimatedusingsimpleregression:
>coef(lm(mpg~wt+hp,data=mtcars))
(Intercept)wthp
37.22727012-3.87783074-0.03177295
>coef(lm(mpg~wt,data=mtcars))
(Intercept)wt
37.285126-5.344472
>coef(lm(mpg~hp,data=mtcars))
(Intercept)hp
30.09886054-0.06822828
Theexplanationhastodowithasubtledifferenceinhowthecoefficientsshouldbe
interpretednowthatthereismorethanoneindependentvariable.Theproperinterpretation
ofthecoefficientbelongingtowtisnotthatastheweightofthecarincreasesby1unit
(1,000pounds),themilespergallon,onanaverage,decreasesby-3.878milespergallon.
Instead,theproperinterpretationisHoldinghorsepowerconstant,astheweightofthecar
increasesby1unit(1,000pounds),themilespergallon,onanaverage,decreasesby
-3.878milespergallon.
Similarly,thecorrectinterpretationofthecoefficientbelongingtowtisHoldingthe
weightofthecarconstant,asthehorsepowerofthecarincreasesby1,themilesper
gallon,onanaverage,decreasesby-0.032milespergallon.Stillconfused?
Itturnsoutthatcarswithmorehorsepowerusemoregas.Itisalsotruethatcarswith
higherhorsepowertendtobeheavier.Whenweputthesepredictors(weightand
horsepower)intoalinearmodeltogether,themodelattemptstoteaseapartthe
independentcontributionsofeachofthevariablesbyremovingtheeffectsoftheother.In
multivariateanalysis,thisisknownascontrollingforavariable.Hence,theprefacetothe
interpretationcanbe,equivalently,statedasControllingfortheeffectsoftheweightofa
car,asthehorsepower….Becausecarswithhigherhorsepowertendtobeheavier,when
youremovetheeffectofhorsepower,theinfluenceofweightgoesdown,andviceversa.
Thisiswhythecoefficientsforthesepredictorsarebothsmallerthantheyareinsimple
single-predictorregression.
Incontrolledexperiments,scientistsintroduceanexperimentalconditionontwosamples
thatarevirtuallythesameexceptfortheindependentvariablebeingmanipulated(for
example,givingonegroupaplaceboandonegrouprealmedication).Iftheyarecareful,
theycanattributeanyobservedeffectdirectlyonthemanipulatedindependentvariable.In
simplecaseslikethis,statisticalcontrolisoftenunnecessary.Butstatisticalcontrolisof
utmostimportanceintheotherareasofscience(especially,thebehavioralandsocial
sciences)andbusiness,whereweareprivyonlytodatafromnon-controllednatural
phenomena.
Forexample,supposesomeonemadetheclaimthatgumchewingcausesheartdisease.To
backupthisclaim,theyappealedtodatashowingthatthemoresomeonechewsgum,the
highertheprobabilityofdevelopingheartdisease.Theastuteskepticcouldclaimthatit’s
notthegumchewingpersethatiscausingtheheartdisease,butthefactthatsmokerstend
tochewgummoreoftenthannon-smokerstomaskthegrosssmelloftobaccosmoke.If
thepersonwhomadetheoriginalclaimwentbacktothedata,andincludedthenumberof
cigarettessmokedperdayasacomponentofaregressionanalysis,therewouldbea
coefficientrepresentingtheindependentinfluenceofgumchewing,andostensibly,the
statisticaltestofthatcoefficient’sdifferencefromzerowouldfailtorejectthenull
hypothesis.
Inthissituation,thenumberofcigarettessmokedperdayiscalledaconfoundingvariable.
Thepurposeofacarefullydesignedscientificexperimentistoeliminateconfounds,butas
mentionedearlier,thisisoftennotaluxuryavailableincertaincircumstancesand
domains.
Forexample,wearesosurethatcigarettesmokingcausesheartdiseasethatitwouldbe
unethicaltodesignacontrolledexperimentinwhichwetaketworandomsamplesof
people,andaskonegrouptosmokeandonegrouptojustpretendtosmoke.Sadly,
cigarettecompaniesknowthis,andtheycanplausiblyclaimthatitisn’tcigarettesmoking
thatcausesheartdisease,butratherthatthekindofpeoplewhoeventuallybecome
cigarettesmokersalsoengageinbehaviorsthatincreasetheriskofheartdisease—like
eatingredmeatandnotexercising—andthatit’sthosevariablesthataremakingitappear
asifsmokingisassociatedwithheartdisease.Sincewecan’tcontrolforeverypotential
confoundthatthecigarettecompaniescandreamup,wemayneverbeabletothwartthis
claim.
Anyhow,backtoourtwo-predictorexample:examinethe value,andhowitisdifferent
nowthatwe’veincludedhorsepowerasanadditionalpredictor.Ourmodelnowexplains
moreofthevarianceingasmileage.Asaresult,ourpredictionswill,onanaverage,be
moreaccurate.
Let’spredictwhatthegasmileageofa2,500poundcarwithahorsepowerof275
(horses?)mightbe:
>predict(model,newdata=data.frame(wt=2.5,hp=275))
1
18.79513
Finally,wecanexplainthelastlineofthelinearmodelsummary:theonewiththeFstatisticandassociatedp-value.TheF-statisticmeasurestheabilityoftheentiremodel,as
awhole,toexplainanyvarianceinthedependentvariable.Sinceithasasampling
distribution(theF-distribution)andassociateddegrees,ityieldsap-value,whichcanbe
interpretedastheprobabilitythatamodelwouldexplainthismuch(ormore)ofthe
varianceofthedependentvariableifthepredictorshadnopredictivepower.Thefactthat
ourmodelhasap-valuelowerthan0.05suggeststhatourmodelpredictsthedependent
variablebetterthanchance.
Nowwecanseewhythep-valuefortheF-statisticinthesimplelinearregressionwasthe
sameasthep-valueofthet-statisticfortheonlynon-interceptpredictor:thetestswere
equivalentbecausetherewasonlyonesourceofpredictivecapability.
Wecanalsoseenowwhythep-valueassociatedwithourF-statisticinthemultiple
regressionanalysisoutputearlierisfarlowerthanthep-valuesofthet-statisticsofthe
individualpredictors:thelatteronlycapturesthepredictivepowerofeach(one)predictor,
whiletheformercapturesthepredictivepowerofthemodelasawhole(alltwo).
Regressionwithanon-binarypredictor
Backinaprevioussection,Ipromisedthatthesamedummy-codingmethodthatweused
toregressbinarycategoricalvariablescouldbeadaptedtohandlecategoricalvariables
withmorethantwovalues.Foranexampleofthis,wearegoingtousethesame
WeightLossdatasetaswedidintoillustrateANOVA.
Toreview,theWeightLossdatasetcontainspoundslostandself-esteemmeasurementsfor
threeweeksforthreedifferentgroups:acontrolgroup,onegroupjustonadiet,andone
groupthatdietedandexercised.Wewillbetryingtopredicttheamountofweightlostin
week2bythegrouptheparticipantwasin.
Insteadofjusthavingonedummy-codedpredictor,wenowneedtwo.Specifically:
Consequently,theequationsdescribingourpredictivemodelare:
Meaningthatthe isthemeanofweightlostinthecontrolgroup, isthedifferencein
theweightlostbetweencontrolanddietonlygroup,and isthedifferenceintheweight
lostbetweenthecontrolandthedietandexercisegroup.
>#thedatasetisinthecarpackage
>library(car)
>model<-lm(wl2~factor(group),data=WeightLoss)
>summary(model)
Call:
lm(formula=wl2~factor(group),data=WeightLoss)
Residuals:
Min1QMedian3QMax
-2.100-1.054-0.1000.9002.900
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)3.33330.37568.8745.12e-10***
factor(group)Diet0.58330.53121.0980.281
factor(group)DietEx2.76670.55714.9662.37e-05***
--Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
Residualstandarderror:1.301on31degreesoffreedom
MultipleR-squared:0.4632,AdjustedR-squared:0.4285
F-statistic:13.37on2and31DF,p-value:6.494e-05
Asbefore,thep-valuesassociatedwiththet-statisticsaredirectlyinterpretableasat-test
ofequalityofmeanswiththeweightlostbythecontrol.Observethatthep-value
associatedwiththet-statisticofthefactor(group)Dietcoefficientisnotsignificant.This
comportswiththeresultsfromthepairwise-t-testfromChapter6,TestingHypotheses.
Mostmagnificently,comparetheF-statisticandtheassociatedp-valueinthepreceding
codewiththeoneintheaovANOVAfromChapter6,TestingHypotheses.Theyarethe
same!TheF-testofalinearmodelwithanon-binarycategoricalvariablepredictoristhe
sameasanNHSTanalysisofvariance!
Kitchensinkregression
Whenthegoalofusingregressionissimplypredictivemodeling,weoftendon’tcare
aboutwhichparticularpredictorsgointoourmodel,solongasthefinalmodelyieldsthe
bestpossiblepredictions.
Anaïve(andawful)approachistousealltheindependentvariablesavailabletotryto
modelthedependentvariable.Let’strythisapproachbytryingtopredictmpgfromevery
othervariableinthemtcarsdataset:
>#theperiodafterthesquigglydenotesallothervariables
>model<-lm(mpg~.,data=mtcars)
>summary(model)
Call:
lm(formula=mpg~.,data=mtcars)
Residuals:
Min1QMedian3QMax
-3.4506-1.6044-0.11961.21934.6271
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)12.3033718.717880.6570.5181
cyl-0.111441.04502-0.1070.9161
disp0.013340.017860.7470.4635
hp-0.021480.02177-0.9870.3350
drat0.787111.635370.4810.6353
wt-3.715301.89441-1.9610.0633.
qsec0.821040.730841.1230.2739
vs0.317762.104510.1510.8814
am2.520232.056651.2250.2340
gear0.655411.493260.4390.6652
carb-0.199420.82875-0.2410.8122
--Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
Residualstandarderror:2.65on21degreesoffreedom
MultipleR-squared:0.869,AdjustedR-squared:0.8066
F-statistic:13.93on10and21DF,p-value:3.793e-07
Hey,checkoutourR-squaredvalue!Itlookslikeourmodelexplains87%ofthevariance
inthedependentvariable.Thisisreallygood—it’scertainlybetterthanoursimple
regressionmodelsthatusedweight(wt)andtransmission(am)withtherespectiveRsquaredvalues,0.753and0.36.
Maybethere’ssomethingtojustincludingeverythingwehaveinourlinearmodels.In
fact,ifouronlygoalistomaximizeourR-squared,youcanalwaysachievethisby
throwingeveryvariableyouhaveintothemix,sincetheintroductionofeachmarginal
variablecanonlyincreasetheamountofvarianceexplained.Evenifanewlyintroduced
variablehasabsolutelynopredictivepower,theworstitcandoisnothelpexplainany
varianceinthedependentvariable—itcannevermakethemodelexplainlessvariance.
Thisapproachtoregressionanalysisisoften(non-affectionately)calledkitchen-sink
regression,andisakintothrowingallofyourvariablesagainstawalltoseewhatsticks.If
youhaveahunchthatthisapproachtopredictivemodelingiscrummy,yourinstinctis
correctonthisone.
Todevelopyourintuitionaboutwhythisapproachbackfires,considerbuildingalinear
modeltopredictavariableofonly32observationsusing200explanatoryvariables,which
areuniformlyandrandomlydistributed.Justbyrandomchance,therewillverylikelybe
somevariablesthatcorrelatestronglytothedependentvariable.Alinearregressionthat
includessomeoftheseluckyvariableswillyieldamodelthatissurprisingly(sometimes
astoundingly)predictive.
Rememberthatwhenwearecreatingpredictivemodels,werarely(ifever)careabout
howwellwecanpredictthedatawealreadyhave.Thewholepointofpredictiveanalytics
istobeabletopredictthebehaviorofdatawedon’thave.Forexample,memorizingthe
answerkeytolastyear’sSocialStudiesfinalwon’thelpyouonthisyear’sfinal,ifthe
questionsarechanged—it’llonlyproveyoucangetanA+onyourlastyear’stest.
Imaginegeneratinganewrandomdatasetof200explanatoryvariablesandonedependent
variable.Usingthecoefficientsfromthelinearmodelofthefirstrandomdataset.How
welldoyouthinkthemodelwillperform?
Themodelwill,ofcourse,performverypoorly,becausethecoefficientsinthemodelwere
informedsolelybyrandomnoise.Themodelcapturedchancepatternsinthedatathatit
wasbuiltwithandnotalarger,moregeneralpattern—mostlybecausetherewasnolarger
patterntomodel!
Instatisticallearningparlance,thisphenomenoniscalledoverfitting,andithappensoften
whentherearemanypredictorsinamodel.Itisparticularlyfrequentwhenthenumberof
observationsislessthan(ornotverymuchlargerthan)thenumberofpredictorvariables
(likeinmtcars),becausethereisagreaterprobabilityforthemanypredictorstohavea
spuriousrelationshipwiththedependentvariable.
Thisgeneraloccurrence—amodelperformingwellonthedataitwasbuiltwithbutpoorly
onsubsequentdata—illustratesperfectlyperhapsthemostcommoncomplicationwith
statisticallearningandpredictiveanalytics:thebias-variancetradeoff.
Thebias-variancetrade-off
Figure8.9:Thetwoextremesofthebias-variancetradeoff:.(left)a(complicated)model
withessentiallyzerobias(ontrainingdata)butenormousvariance,(right)asimplemodel
withhighbiasbutvirtuallynovariance
Instatisticallearning,thebiasofamodelreferstotheerrorofthemodelintroducedby
attemptingtomodelacomplicatedreal-liferelationshipwithanapproximation.Amodel
withnobiaswillnevermakeanyerrorsinprediction(likethecookie-areaprediction
problem).Amodelwithhighbiaswillfailtoaccuratelypredictitsdependentvariable.
Thevarianceofamodelreferstohowsensitiveamodelistochangesinthedatathatbuilt
themodel.Amodelwithlowvariancewouldchangeverylittlewhenbuiltwithnewdata.
Alinearmodelwithhighvarianceisverysensitivetochangestothedatathatitwasbuilt
with,andtheestimatedcoefficientswillbeunstable.
Thetermbias-variancetradeoffillustratesthatitiseasytodecreasebiasattheexpenseof
increasingvariance,andvice-versa.Goodmodelswilltrytominimizeboth.
Figure8.9depictstwoextremesofthebias-variancetradeoff.Theleft-mostmodeldepicts
acomplicatedandhighlyconvolutedmodelthatpassesthroughallthedatapoints.This
modelhasessentiallynobias,asithasnoerrorwhenpredictingthedatathatitwasbuilt
with.However,themodelisclearlypickinguponrandomnoiseinthedataset,andifthe
modelwereusedtopredictnewdata,therewouldbesignificanterror.Ifthesamegeneral
modelwererebuiltwithnewdata,themodelwouldchangesignificantly(highvariance).
Asaresult,themodelisnotgeneralizabletonewdata.Modelslikethissufferfrom
overfitting,whichoftenoccurswhenoverlycomplicatedoroverlyflexiblemodelsare
fittedtodata—especiallywhensamplesizeislacking.
Incontrast,themodelontherightpanelofFigure8.9isasimplemodel(thesimplest,
actually).Itisjustahorizontallineatthemeanofthedependentvariable,mpg.Thisdoesa
prettyterriblejobmodelingthevarianceinthedependentvariable,andexhibitshighbias.
Thismodeldoeshaveoneattractivepropertythough—themodelwillbarelychangeatall
iffittonewdata;thehorizontallinewilljustmoveupordownslightlybasedonthemean
ofthempgcolumnofthenewdata.
Todemonstratethatourkitchensinkregressionputsusonthewrongsideoftheoptimal
pointinthebias-variancetradeoff,wewilluseamodelvalidationandassessment
techniquecalledcross-validation.
Cross-validation
Giventhatthegoalofpredictiveanalyticsistobuildgeneralizablemodelsthatpredict
wellfordatayetunobserved,weshouldideallybetestingourmodelsondataunseen,and
checkourpredictionsagainsttheobservedoutcomes.Theproblemwiththat,ofcourse,is
thatwedon’tknowtheoutcomesofdataunseen—that’swhywewantapredictivemodel.
Wedo,however,haveatrickupoursleeve,calledthevalidationsetapproach.
Thevalidationsetapproachisatechniquetoevaluateamodel’sabilitytoperformwellon
anindependentdataset.Butinsteadofwaitingtogetourhandsonacompletelynew
dataset,wesimulateanewdatasetwiththeonewealreadyhave.
Themainideaisthatwecansplitourdatasetintotwosubsets;oneofthesesubsets(called
thetrainingset)isusedtofitourmodel,andthentheother(thetestingset)isusedtotest
theaccuracyofthatmodel.Sincethemodelwasbuiltbeforeevertouchingthetestingset,
thetestingsetservesasanindependentdatasourceofpredictionaccuracyestimates,
unbiasedbythemodel’sprecisionattributabletoitsmodelingofidiosyncraticnoise.
Togetatourpredictiveaccuracybyperformingourownvalidationset,let’susethe
samplefunctiontodividetherowindicesofmtcarsintotwoequalgroups,createthe
subsets,andtrainamodelonthetrainingset:
>set.seed(1)
>train.indices<-sample(1:nrow(mtcars),nrow(mtcars)/2)
>training<-mtcars[train.indices,]
>testing<-mtcars[-train.indices,]
>model<-lm(mpg~.,data=training)
>summary(model)
…..(outputtruncated)
Residualstandarderror:1.188on5degreesoffreedom
MultipleR-squared:0.988,AdjustedR-squared:0.9639
F-statistic:41.06on10and5DF,p-value:0.0003599
Beforewegoon,notethatthemodelnowexplainsawhopping99%ofthevariancein
mpg.Any
thishighshouldbearedflag;I’veneverseenalegitimatemodelwithanRsquaredthishighonanon-contriveddataset.Theincreasein isattributableprimarily
duetothedecreaseinobservations(from32to16)andtheresultantincreasedopportunity
tomodelspuriouscorrelations.
Let’scalculatetheMSEofthemodelonthetrainingdataset.Todothis,wewillbeusing
thepredictfunctionwithoutthenewdataargument,whichtellsusthemodelitwould
predictongiventhetrainingdata(thesearereferredtoasthefittedvalues):
>mean((predict(model)-training$mpg)^2)
[1]0.4408109
#Cool,buthowdoesitperformonthevalidationset?
>mean((predict(model,newdata=testing)-testing$mpg)^2)
[1]337.9995
Myword!
Inpractice,theerroronthetrainingdataisalmostalwaysalittlelessthantheerroronthe
testingdata.However,adiscrepancyintheMSEbetweenthetrainingandtestingsetas
largeasthisisaclear-as-dayindicationthatourmodeldoesn’tgeneralize.
Let’scomparethismodel’svalidationsetperformancetoasimplermodelwithalower
,whichonlyusesamandwtaspredictors:
>simpler.model<-lm(mpg~am+wt,data=training)
>mean((predict(simpler.model)-training$mpg)^2)
[1]9.396091
>mean((predict(simpler.model,newdata=testing)-testing$mpg)^2)
[1]12.70338
NoticethattheMSEonthetrainingdataismuchhigher,butourvalidationsetMSEis
muchlower.
Ifthegoalistoblindlymaximizethe ,themorepredictors,thebetter.Ifthegoalisa
generalizableandusefulpredictivemodel,thegoalshouldbetominimizethetestingset
MSE.
Thevalidationsetapproachoutlinedinthepreviousparagraphhastwoimportant
drawbacks.Forone,themodelwasonlybuiltusinghalfoftheavailabledata.Secondly,
weonlytestedthemodel’sperformanceononetestingset;attheslightofamagician’s
hand,ourtestingsetcouldhavecontainedsomebizarrehard-to-predictexamplesthat
wouldmakethevalidationsetMSEtoolarge.
Considerthefollowingchangetotheapproach:wedividethedataup,justasbefore,into
setaandsetb.Then,wetrainthemodelonseta,testitonsetb,thentrainitonbandtest
itona.Thisapproachhasaclearadvantageoverourpreviousapproach,becauseit
averagestheout-of-sampleMSEoftwotestingsets.Additionally,themodelwillnowbe
informedbyallthedata.Thisiscalledtwo-foldcrossvalidation,andthegeneral
techniqueiscalledk-foldcrossvalidation.
Note
Thecoefficientsofthemodelwill,ofcourse,bedifferent,buttheactualdatamodel(the
variablestoincludeandhowtofittheline)willbethesame.
Toseehowk-foldcrossvalidationworksinamoregeneralsense,considertheprocedure
toperformk-foldcrossvalidationwherek=5.First,wedividethedataintofiveequal
groups(setsa,b,c,d,ande),andwetrainthemodelonthedatafromsetsa,b,c,andd.
ThenwerecordtheMSEofthemodelagainstunseendatainsete.Werepeatthisfour
moretimes—leavingoutadifferentsetandtestingthemodelwithit.Finally,theaverage
ofourfiveout-of-sampleMSEsisourfive-foldcrossvalidatedMSE.
Yourgoal,now,shouldbetoselectamodelthatminimizesthek-foldcrossvalidation
MSE.Commonchoicesofkare5and10.
Toperformk-foldcrossvalidation,wewillbeusingthecv.glmfunctionfromtheboot
package.Thiswillalsorequireustobuildourmodelsusingtheglmfunction(thisstands
forgeneralizedlinearmodels,whichwe’lllearnaboutinthenextchapter)insteadoflm.
Forcurrentpurposes,itisadrop-inreplacement:
>library(boot)
>bad.model<-glm(mpg~.,data=mtcars)
>better.model<-glm(mpg~am+wt+qsec,data=mtcars)
>
>bad.cv.err<-cv.glm(mtcars,bad.model,K=5)
>#thecross-validatedMSEestimatewewillbeusing
>#isabias-correctedonestoredasthesecondelement
>#inthe'delta'vectorofthecv.errobject
>bad.cv.err$delta[2]
[1]14.92426
>
>better.cv.err<-cv.glm(mtcars,better.model,K=5)
>better.cv.err$delta[2]
[1]7.944148
Theuseofk-foldcrossvalidationoverthesimplevalidationsetapproachhasillustrated
thatthekitchen-sinkmodelisnotasbadaswepreviouslythought(becausewetrainedit
usingmoredata),butitisstilloutperformedbythefarsimplermodelthatincludesonly
am,wt,andqsecaspredictors.
Thisout-performancebyasimplemodelisnoidiosyncrasyofthisdataset;itisawellobservedphenomenoninpredictiveanalytics.Simplermodelsoftenoutperformoverly
complicatedmodelsbecauseoftheresistanceofasimplermodeltooverfitting.Further,
simplermodelsareeasiertointerpret,tounderstand,andtouse.Theideathat,giventhe
samelevelofpredictivepower,weshouldprefersimplermodelstocomplicatedonesis
expressedinafamousprinciplecalledOccam’sRazor.
Finally,wehaveenoughbackgroundinformationtodiscusstheonlypieceofthelm
summaryoutputwehaven’ttoucheduponyet:adjustedR-squared.Adjusted attempts
totakeintoaccountthefactthatextraneousvariablesthrownintoalinearmodelwill
alwaysincreaseits .Adjusted ,therefore,takesthenumberofpredictorsinto
account.Assuch,itpenalizescomplexmodels.Adjusted willalwaysbeequaltoor
lowerthannon-adjusted (itcanevengonegative!).Theadditionofeachmarginal
predictorwillonlycauseanincreaseinadjustedifitcontributessignificantlytothe
predictivepowerofthemodel,thatis,morethanwouldbedictatedbychance.Ifit
doesn’t,theadjusted willdecrease.Adjusted hassomegreatproperties,andasa
result,manywilltrytoselectmodelsthatmaximizetheadjusted ,butIpreferthe
minimizationofcross-validatedMSEasmymainmodelselectioncriterion.
Compareforyourselftheadjusted
andqsec.
ofthekitchen-sinkmodelandamodelusingam,wt,
Strikingabalance
AsFigure8.10depicts,asamodelbecomesmorecomplicated/flexible—asitstartsto
includemoreandmorepredictors—thebiasofthemodelcontinuestodecrease.Alongthe
complexityaxis,asthemodelbeginstofitthedatabetterandbetter,thecross-validation
errordecreasesaswell.Atacertainpoint,themodelbecomesoverlycomplex,andbegins
tofitidiosyncraticnoiseinthetrainingdataset—itoverfits!Thecross-validationerror
beginstoclimbagain,evenasthebiasofthemodelapproachesitstheoreticalminimum!
Theveryleftoftheplotdepictsmodelswithtoomuchbias,butlittlevariance.Theright
sideoftheplotdepictsmodelsthathaveverylowbias,butveryhighvariance,andthus,
areuselesspredictivemodels.
Figure8.10:Asmodelcomplexity/flexibilityincreases,trainingerror(bias)tendstobe
reduced.Uptoacertainpoint,thecross-validationerrordecreasesaswell.Afterthat
point,thecross-validationerrorstartstogoupagain,evenasthemodel’sbiascontinues
todecrease.Afterthispoint,themodelistooflexibleandoverfits.
Theidealpointinthisbias-variancetradeoffisatthepointwherethecross-validation
error(notthetrainingerror)isminimized.
Okay,sohowdowegetthere?
Althoughtherearemoreadvancedmethodsthatwe’lltouchoninthesectioncalled
AdvancedTopics,atthisstageofthegame,ourprimaryrecourseforfindingourbiasvariancetradeoffsweetspotiscarefulfeatureselection.
Instatisticallearningparlance,featureselectionreferstoselectingwhichpredictor
variablestoincludeinourmodel(forsomereason,theycallpredictorvariablesfeatures).
Iemphasizedthewordcareful,becausethereareplentyofdangerouswaystodothis.One
suchmethod—andperhapsthemostintuitive—istosimplybuildmodelscontainingevery
possiblesubsetoftheavailablepredictors,andchoosethebestoneasmeasuredby
Adjusted ortheminimizationofcross-validatederror.Probably,thebiggestproblem
withthisapproachisthatit’scomputationallyveryexpensive—tobuildamodelforevery
possiblesubsetofpredictorsinmtcars,youwouldneedtobuild(andcrossvalidate)1,023
differentmodels.Thenumberofpossiblemodelsrisesexponentiallywiththenumberof
predictors.Becauseofthis,formanyreal-worldmodelingscenarios,thismethodisoutof
thequestion.
Thereisanotherapproachthat,forthemostpart,solvestheproblemofthecomputational
intractabilityoftheall-possible-subsetsapproach:step-wiseregression.
Stepwiseregressionisatechniquethatprogrammaticallytestsdifferentpredictor
combinationsbyaddingpredictorsin(forwardstepwise),ortakingpredictorsout
(backwardstepwise)accordingthevaluethateachpredictoraddstothemodelas
measuredbyitsinfluenceontheadjusted .Therefore,liketheall-possible-subsets
approach,stepwiseregressionautomatestheprocessoffeatureselection.
Note
Incaseyoucare,themostpopularimplementationofthistechnique(thestepAICfunction
intheMASSpackage)inRdoesn’tmaximizeAdjusted but,instead,minimizesarelated
modelqualitymeasurecalledtheAkaikeInformationCriterion(AIC).
Therearenumerousproblemswiththisapproach.Theleastoftheseisthatitisnot
guaranteedtofindthebestpossiblemodel.
Oneoftheprimaryissuesthatpeopleciteisthatitresultsinlazysciencebyabsolvingus
oftheneedtothinkouttheproblem,becauseweletanautomatedproceduremake
decisionsforus.Thisschoolofthoughtusuallyholdsthatmodelsshouldbeinformed,at
leastpartially,bysomeamountoftheoryanddomainexpertise.
Itisforthesereasonsthatstepwiseregressionhasfallenoutoffavoramongmany
statisticians,andwhyI’mchoosingnottorecommendusingit.
Stepwiseregressionislikealcohol:somepeoplecanuseitwithoutincident,butsome
can’tuseitsafely.Itisalsolikealcoholinthatifyouthinkyouneedtouseit,you’vegota
bigproblem.Finally,neithercanbeadvertisedtochildren.
Atthisstageofthegame,Isuggestthatyourmainapproachtobalancingbiasand
varianceshouldbeinformedtheory-drivenfeatureselection,andpayingcloseattentionto
k-foldcrossvalidationresults.Incaseswhereyouhaveabsolutelynotheory,Isuggest
usingregularization,atechniquethatis,unfortunately,beyondthescopeofthistext.The
sectionAdvancedtopicsbrieflyextolsthevirtuesofregularization,ifyouwantmore
information.
Linearregressiondiagnostics
IwouldbenegligentifIfailedtomentiontheboringbutverycriticaltopicofthe
assumptionsoflinearmodels,andhowtodetectviolationsofthoseassumptions.Justlike
theassumptionsofthehypothesistestsinChapter6,TestingHypotheseslinearregression
hasitsownsetofassumptions,theviolationofwhichjeopardizetheaccuracyofour
model—andanyinferencesderivedfromit—tovaryingdegrees.Thechecksandteststhat
ensuretheseassumptionsaremetarecalleddiagnostics.
Therearefivemajorassumptionsoflinearregression:
Thattheerrors(residuals)arenormallydistributedwithameanof0
Thattheerrortermsareuncorrelated
Thattheerrorshaveaconstantvariance
Thattheeffectoftheindependentvariablesonthedependentvariablearelinearand
additive
Thatmulti-collinearityisataminimum
We’llbrieflytouchontheseassumptions,andhowtocheckfortheminthissectionhere.
Todothis,wewillbeusingaresidual-fittedplot,sinceitallowsus,withsomeskill,to
verifymostoftheseassumptions.Toviewaresidual-fittedplot,justcalltheplotfunction
onyourlinearmodelobject:
>my.model<-lm(mpg~wt,data=mtcars)
>plot(my.model)
Thiswillshowyouaseriesoffourdiagnosticplots—theresidual-fittedplotisthefirst.
Youcanalsoopttoviewjusttheresidual-fittedplotwiththisrelatedincantation:
>plot(my.model,which=1)
WearealsogoingbacktoAnscombe’sQuartet,sincethequartet’saberrantrelationships
collectivelyillustratetheproblemsthatyoumightfindwithfittingregressionmodelsand
assumptionviolation.Tore-familiarizeyourselfwiththequartet,lookbacktoFigure8.6.
SecondAnscomberelationship
ThefirstrelationshipinAnscombe’sQuartet(y1~x1)istheonlyonethatcan
appropriatelybemodeledwithlinearregressionasis.Incontrast,thesecondrelationship
(y2~x2)depictsarelationshipthatviolatestherequirementofalinearrelationship.Italso
subtlyviolatestheassumptionofnormallydistributedresidualswithameanofzero.To
seewhy,refertoFigure8.11,whichdepictsitsresidual-fittedplot:
Figure8.11:ThetoptwopanelsshowthefirstandsecondrelationshipsofAnscombe’s
quartet,respectively.Thebottomtwopanelsdepicteachtoppanel’srespectiveresidualfittedplot
Anon-pathologicalresidual-fittedplotwillhavedatapointsrandomlydistributedalong
theinvisiblehorizontalline,wherethey-axisequals0.Bydefault,thisplotalsocontainsa
smoothcurvethatattemptstofittheresiduals.Inanon-pathologicalsample,thissmooth
curveshouldbeapproximatelystraight,andstraddlethelineaty=0.
Asyoucansee,thefirstAnscomberelationshipdoesthiswell.Incontrast,thesmooth
curveofthesecondrelationshipisaparabola.Theseresidualscouldhavebeendrawn
fromanormaldistributionwithameanofzero,butitishighlyunlikely.Instead,itlooks
liketheseresidualsweredrawnfromadistribution—perhapsfromanormaldistribution—
whosemeanchangedasafunctionofthex-axis.Specifically,itappearsasiftheresiduals
atthetwoendsweredrawnfromadistributionwhosemeanwasnegative,andthemiddle
residualshadapositivemean.
ThirdAnscomberelationship
Wealreadydugdeeperintothisrelationshipwhenwespokeofrobustregressionearlierin
thechapter.Wesawthatarobustfitofthisrelationshipmoreoflessignoredtheclear
outlier.Indeed,therobustfitisalmostidenticaltothenon-robustlinearfitaftertheoutlier
isremoved.
Onoccasion,adatapointthatisanoutlierinthey-axisbutnotthex-axis(likethisone)
doesn’tinfluencetheregressionlinemuch—meaningthatitsomissionwouldn’tcausea
substantialchangeintheestimatedinterceptandcoefficients.
Adatapointthatisanoutlierinthex-axis(oraxes)issaidtohavehighleverage.
Sometimes,pointswithhighleveragedon’tinfluencetheregressionlinemuch,either.
However,datapointsthathavehighleverageandareoutliersveryoftenexerthigh
influenceontheregressionfit,andmustbehandledappropriately.
Refertotheupper-rightpanelofFigure8.12.Theaberrantdatapointinthefourth
relationshipofAnscombe’squartethasveryhighleverageandhighinfluence.Notethat
theslopeoftheregressionlineiscompletelydeterminedbythey-positionofthatpoint.
FourthAnscomberelationship
Thefollowingimagedepictssomeofthelinearregressiondiagnosticplotsofthefourth
Anscomberelationship:
Figure8.12:ThefirstandthefourthAnscomberelationshipsandtheirrespectiveresidualfittedplots
Althoughit’sdifficulttosayforsure,thisisprobablyinviolationoftheassumptionof
constantvarianceofresiduals(alsocalledhomogeneityofvarianceorhomoscedasticityif
you’reafancy-pants).
Amoreillustrativeexampleoftheviolationofhomoscedasticity(orheteroscedasticity)is
showninFigure8.13:
Figure8.13:Aparadigmaticdepictionoftheresidual-fittedplotofaregressionmodelfor
whichtheassumptionofhomogeneityofvarianceisviolated
Theprecedingplotdepictsthecharacteristicfunnelshapesymptomaticofresidual-fitted
plotsofoffendingregressionmodels.Noticehowontheleft,theresidualsvaryverylittle,
butthevariancesgrowasyougoalongthex-axis.
Bearinmindthattheresidual-fittedplotneednotresembleafunnel—anyresidual-fitted
plotthatveryclearlyshowsthevariancechangeasafunctionofthex-axis,violatesthis
assumption.
LookingbackonAnscombe’sQuartet,youmaythinkthatthethreerelationships’
unsuitabilityforlinearmodelingwasobvious,andyoumaynotimmediatelyseethe
benefitofdiagnosticplots.Butbeforeyouwriteofftheart(notscience)oflinear
regressiondiagnostics,considerthatthesewereallrelationshipswithasinglepredictor.In
multipleregression,withtensofpredictors(ormore),itisverydifficulttodiagnose
problemsbyjustplottingdifferentcutsofthedata.Itisinthisdomainwherelinear
regressiondiagnosticsreallyshine.
Finally,thelasthazardtobemindfulofwhenlinearlyregressingistheproblemof
collinearityormulticollinearity.Collinearityoccurswhentwo(ormore)predictorsare
veryhighlycorrelated.Thiscausesmultipleproblemsforregressionmodels,including
highlyuncertainandunstablecoefficientestimates.Anextremeexampleofthiswouldbe
ifwearetryingtopredictweightfromheight,andwehadbothheightinfeetandheightin
metersaspredictors.Initsmostsimplecase,collinearitycanbecheckedforbylookingat
thecorrelationmatrixofalltheregressors(usingthecorfunction);anycellthathasahigh
correlationcoefficientimplicatestwopredictorsthatarehighlycorrelatedand,therefore,
holdredundantinformationinthemodel.Intheory,oneofthesepredictorsshouldbe
removed.
Amoresneakyissuepresentsitselfwhentherearenotwoindividualpredictorsthatare
highlycorrelated,buttherearemultiplepredictorsthatarecollectivelycorrelated.Thisis
multicollinearity.Thiswouldoccurtoasmallextent,forexample,ifinsteadofpredicting
mpgfromothervariablesinthemtcarsdataset,weweretryingtopredicta(non-existent)
newvariableusingmpgandtheotherpredictors.Sinceweknowthatmpgcanbefairly
reliablyestimatedfromsomeoftheothervariablesinmtcars,whenitisapredictorina
regressionmodelinganothervariable,itwouldbedifficulttotellwhetherthetarget’s
varianceistrulyexplainedbympg,orwhetheritisexplainedbympg‘spredictors.
Themostcommontechniquetodetectmulticollinearityistocalculateeachpredictor
variable’sVarianceInflationFactor(VIF).TheVIFmeasureshowmuchlargerthe
varianceofacoefficientisbecauseofitscollinearity.Mathematically,theVIFofa
predictor, ,is:
where
isthe
ofalinearmodelpredicting fromallotherpredictors(
).
Assuch,theVIFhasalowerboundofone(inthecasethatthepredictorcannotbe
predictedaccuratelyfromtheotherpredictors).Itsupperboundisasymptoticallyinfinite.
Ingeneral,mostviewVIFsofmorethanfourascauseforconcern,andVIFsof10or
aboveindicativeofaveryhighdegreeofmulticollinearity.YoucancalculateVIFsfora
model,posthoc,withtheviffunctionfromthecarpackage:
>model<-lm(mpg~am+wt+qsec,data=mtcars)
>library(car)
>vif(model)
amwtqsec
2.5414372.4829521.364339
Advancedtopics
Linearmodelsarethebiggestideainappliedstatisticsandpredictiveanalytics.Thereare
massivevolumeswrittenaboutthesmallestdetailsoflinearregression.Assuch,thereare
someimportantideasthatwecan’tgooverherebecauseofspaceconcerns,orbecauseit
requiresknowledgebeyondthescopeofthisbook.Soyoudon’tfeellikeyou’reinthe
dark,though,herearesomeofthetopicswedidn’tcover—andthatIwouldhavelikedto
—andwhytheyareneat.
Regularization:Regularizationwasmentionedbrieflyinthesubsectionabout
balancingbiasandvariance.Inthiscontext,regularizationisatechniquewhereinwe
penalizemodelsforcomplexity,tovaryingdegrees.Myfavoritemethodof
regularizinglinearmodelsisbyusingelastic-netregression.Itisafantastic
techniqueand,ifyouareinterestedinlearningmoreaboutit,Isuggestyouinstall
andreadthevignetteoftheglmnetpackage:
>install.packages("glmnet")
>library(glmnet)
>vignette("glmnet_beta")
Non-linearmodeling:Surprisingly,wecanmodelhighlynon-linearrelationships
usinglinearregression.Forexample,let’ssaywewantedtobuildamodelthat
predictshowmanyraisinstouseforacookieusingthecookie’sradiusasapredictor.
Therelationshipbetweenpredictorandtargetisnolongerlinear—it’squadratic.
However,ifwecreateanewpredictorthatistheradiussquared,thetargetwillnow
havealinearrelationshipwiththenewpredictor,andthus,canbecapturedusing
linearregression.Thisbasicpremisecanbeextendedtocapturerelationshipsthatare
cubic(powerof3),quartic(powerof4),andsoon;thisiscalledpolynomial
regression.Otherformsofnon-linearmodelingdon’tusepolynomialfeatures,but
instead,directlyfitnon-linearfunctionstothepredictors.Amongtheseformsinclude
regressionsplinesandGeneralizedAdditiveModels(GAMs).
Interactionterms:Justliketherearegeneralizationsoflinearregressionthatremove
therequirementoflinearity,sotooaretheregeneralizationsoflinearregressionsthat
eliminatetheneedforthestrictlyadditiveandindependenteffectsbetween
predictors.
Takegrapefruitjuice,forexample.Grapefruitjuiceiswellknowntoblockintestinal
enzymeCYP3A,anddrasticallyeffecthowthebodyabsorbscertainmedicines.Let’s
pretendthatgrapefruitjuicewasmildlyeffectiveattreatingexistentialdysphoria.
AndsupposethereisadrugcalledSomathatwashighlyeffectiveattreatingthis
condition.Whenalleviationofsymptomsisplottedasafunctionofdose,the
grapefruitjuicewillhaveaverysmallslope,buttheSomawillhaveaverylarge
slope.Now,ifwealsopretendthatgrapefruitjuiceincreasestheefficiencyofSoma
absorption,thenthereliefofdysphoriaofsomeonetakingbothgrapefruitjuiceand
Somawillbefarhigherthanwouldbepredictedbyamultipleregressionmodelthat
doesn’ttakeintoaccountthesynergisticeffectsofSomaandthejuice.Thesimplest
waytomodelthisinteractioneffectistoincludetheinteractionterminthelm
formula,likeso:
>my.model<-lm(relief~soma*juice,data=my.data)
whichbuildsalinearregressionformulaofthefollowingform:
whereif islargerthan and thenthereisaninteractioneffectthatisbeing
modeled.Ontheotherhand,if iszeroand and arepositive,thatsuggeststhat
thegrapefruitjuicecompletelyblockstheeffectofSoma(andviceversa).
Bayesianlinearregression:Bayesianlinearregressionisanalternativeapproachto
theprecedingmethodsthatoffersalotofcompellingbenefits.Oneofthemajor
benefitsofBayesianlinearregression—whichechoesthebenefitsofBayesian
methodsasawhole—isthatweobtainaposteriordistributionofcrediblevaluesfor
eachofthebetacoefficients.Thismakesiteasytomakeprobabilisticstatements
aboutintervalsinwhichthepopulationcoefficientislikelytolie.Thismakes
hypothesistestingveryeasy.
Anothermajorbenefitisthatwearenolongerheldhostagetotheassumptionthatthe
residualsarenormallydistributed.Ifyouwerethegoodpersonyoulayclaimtobeing
onyouronlinedatingprofiles,youwouldhavedonetheexercisesattheendofthe
lastchapter.Ifso,youwouldhaveseenhowwecouldusethet-distributiontomake
ourmodelsmorerobusttotheinfluenceofoutliers.InBayesianlinearregression,it
iseasytouseat-distributedlikelihoodfunctiontodescribethedistributionofthe
residuals.Lastly,byadjustingthepriorsonthebetacoefficientsandmakingthem
sharplypeakedatzero,weachieveacertainamountofshrinkageregularizationfor
free,andbuildmodelsthatareinherentlyresistanttooverfitting.
Exercises
Practicethefollowingexercisestorevisetheconceptslearnedthusfar:
Byfar,thebestwaytobecomecomfortableandlearnthein-and-outsofapplied
regressionanalysisistoactuallycarryoutregressionanalyses.Tothisend,youcan
usesomeofthemanydatasetsthatareincludedinR.Togetafulllistingofthe
datasetsinthedatasetspackage,executethefollowing:
>help(package="datasets")
TherearehundredsofmoredatasetsspreadacrosstheotherseveralthousandR
packages.Evenbetter,loadyourowndatasets,andattempttomodelthem.
Examineandplotthedatasetpressure,whichdescribestherelationshipbetweenthe
vaporpressureofmercuryandtemperature.Whatassumptionoflinearregression
doesthisviolate?Attempttomodelthisusinglinearregressionbyusing
temperaturesquaredasapredictor,likethis:
>lm(pressure~I(temperature^2),data=pressure)
Comparethefitbetweenthemodelthatusesthenon-squaredtemperatureandthis
one.Explorecubicandquarticrelationshipsbetweentemperatureandpressure.
Howaccuratelycanyoupredictpressure?Employcross-validationtomakesure
thatnooverfittinghasoccurred.Marvelathownicelyphysicsplayswithstatistics
sometimes,andwishthatthebehavioralscienceswouldbehavebetter.
Keepaneyeoutforprovocativenewsandhuman-intereststoriesorpopularculture
anecdotesthatclaimsuspectcausalrelationshipslikegumchewingcausesheart
diseaseordarkchocolatepromotesweightloss.Iftheseclaimswerebackedupusing
datafromnaturalexperiments,trytothinkofpotentialconfoundingvariablesthat
invalidatetheclaim.Impressuponyourfriendsandfamilythatthemediaistryingto
takeadvantageoftheirgullibilityandnon-fluencyintheprinciplesofstatistics.As
youbecomemoreadeptatrecognizingsuspiciousclaims,you’llbeinvitedtofewer
andfewerparties.Thiswillclearupyourscheduleformorestudying.
TowhatextentcanMikhailGorbachev’srevisionismoflateStalinismbeviewedasa
precipitatingfactorinthefalloftheBerlinWall?Exceptionalresponseswilladdress
theeffectsofWesterninterpretationsofMarxonthepost-warSovietIntelligentsia.
Summary
Whew,we’vebeenthroughalotinthischapter,andIcommendyouforstickingitout.
Yourtenacitywillbewellrewardedwhenyoustartusingregressionanalysisinyourown
projectsorresearchlikeaprofessional.
Westartedoffwiththebasics:howtodescribealine,simplelinearrelationships,andhow
abest-fitregressionlineisdetermined.YousawhowwecanuseRtoeasilyplotthese
best-fitlines.
Wewentontoexploreregressionanalysiswithmorethanonepredictor.Youlearnedhow
tointerprettheloquaciouslmsummaryoutput,andwhateverythingmeant.Inthecontext
ofmultipleregression,youlearnedhowthecoefficientsareproperlyinterpretedasthe
effectofapredictorcontrollingforallotherpredictors.You’renowawarethatcontrolling
forandthinkingaboutconfoundsisoneofthecornerstonesofstatisticalthinking.
Wediscoveredthatweweren’tlimitedtousingcontinuouspredictors,andthat,using
dummycoding,wecannotonlymodeltheeffectsofcategoricalvariables,butalso
replicatethefunctionalities,two-samplet-testandone-wayANOVA.
Youlearnedofthehazardsofgoinghog-wildandincludingallavailablepredictorsina
linearmodel.Specifically,you’vecometofindoutthatrecklesspursuitofR^2
maximizationisalosingstrategywhenitcomestobuildinginterpretable,generalizable,
andusefulmodels.You’velearnedthatitisfarbettertominimizeout-of-sampleerror
usingestimatesfromcrossvalidation.Weframedthispreferencefortesterror
minimizationoftrainingerrorminimizationintermsofthebias-variancetradeoff.
Penultimately,youlearnedthestandardassumptionsoflinearregressionandtouchedupon
somewaystodeterminewhetherourassumptionshold.Youcametounderstandthat
regressiondiagnosticsisn’tanexactscience.
Lastly,youlearnedthatthere’smuchwehaven’tlearnedaboutregressionanalysis.This
willkeepushumbleandhungryformoreknowledge.
Chapter9.PredictingCategorical
Variables
Ourfirstforayintopredictiveanalyticsbeganwithregressiontechniquesforpredicting
continuousvariables.Inthischapter,wewillbediscussingaperhapsevenmorepopular
classoftechniquesfromstatisticallearningknownasclassification.
Allthesetechniqueshaveatleastonethingincommon:wetrainalearneroninput,for
whichthecorrectclassificationsareknown,withtheintentionofusingthetrainedmodel
onnewdatawhoseclassisunknown.Inthisway,classificationisasetofalgorithmsand
methodstopredictcategoricalvariables.
Whetheryouknowitornot,statisticallearningalgorithmsperformingclassificationare
allaroundyou.Forexample,ifyou’veeveraccidentlycheckedtheSpamfolderofyouremailandbeenhorrified,youcanthankyourluckystarsthattherearesophisticated
classificationmechanismsthatyoure-mailisrunthroughtoautomaticallymarkspamas
suchsoyoudon’thavetoseeit.Ontheotherhand,ifyou’veeverhadalegitimatee-mail
senttospam,oraspame-mailsneakpastthespamfilterintoyourinbox,you’ve
witnessedthelimitationsofclassificationalgorithmsfirsthand:sincethee-mailsaren’t
beingauditedbyahumanone-by-one,andarebeingauditedbyacomputerinstead,
misclassificationhappens.Justlikeourlinearregressionpredictionsdifferedfromour
trainingdatatovaryingdegrees,sotoodoclassificationalgorithmsmakemistakes.Our
jobistomakesurewebuildmodelsthatminimizethesemisclassifications—ataskwhich
isnotalwayseasy.
TherearemanydifferentclassificationmethodsavailableinR;wewillbelearningabout
fourofthemostpopularonesinthischapter—startingwithk-NearestNeighbors.
k-NearestNeighbors
You’reatatrainterminallookingfortherightlinetostandintogetonthetrainfrom
UpstateNYtoPennStationinNYC.You’vesettledintowhatyouthinkistherightline,
butyou’restillnotsurebecauseit’ssocrowdedandchaotic.Notwantingtowaitinthe
wrongline,youturntothepersonclosesttoyouandaskthemwherethey’regoing:“Penn
Station,”saysthestranger,blithely.
Youdecidetogetsomesecondopinions.Youturntothesecondclosestpersonandthe
thirdclosestpersonandaskthemseparately:PennStationandNovaScotiarespectively.
Thegeneralconsensusseemstobethatyou’reintherightline,andthat’sgoodenoughfor
you.
Ifyou’veunderstoodtheprecedinginteraction,youalreadyunderstandtheideabehindkNearestNeighbors(k-NNhereafter)onafundamentallevel.Inparticular,you’vejust
performedk-NN,wherek=3.Hadyoujuststoppedatthefirstperson,youwouldhave
performedk-NN,wherek=1.
So,k-NNisaclassificationtechniquethat,foreachdatapointwewanttoclassify,finds
thekclosesttrainingdatapointsandreturnstheconsensus.Intraditionalsettings,themost
commondistancemetricisEuclideandistance(which,intwodimensions,isequaltothe
distancefrompointatopointbgivenbythePythagoreanTheorem).Anothercommon
distancemetricisManhattandistance,which,intwodimensions,isequaltothesumof
thelengthofthelegsofthetriangleconnectingtwodatapoints.
Figure9.1:TwopointsonaCartesianplane.TheirEuclideandistanceis5.Their
Manhattandistanceis3+4=7
k-NearestNeighborsisabitofanoddballtechnique;moststatisticallearningmethods
attempttoimposeaparticularmodelonthedataandestimatetheparametersofthat
model.Putanotherway,thegoalofmostlearningmethodsistolearnanobjectivefunction
thatmapsinputstooutputs.Oncetheobjectivefunctionislearned,thereisnolongera
needforthetrainingset.
Incontrast,k-NNlearnsnosuchobjectivefunction.Rather,itletsthedataspeakfor
themselves.Sincethereisnoactuallearning,perse,goingon,k-NNneedstoholdonto
trainingdatasetforfutureclassifications.Thisalsomeansthatthetrainingstepis
instantaneous,sincethereisnotrainingtobedone.Mostofthetimespentduringthe
classificationofadatapointisspentfindingitsnearestneighbors.Thispropertyofk-NN
makesitalazylearningalgorithm.
Sincenoparticularmodelisimposedonthetrainingdata,k-NNisoneofthemostflexible
andaccurateclassificationlearnersthereare,anditisverywidelyused.Withgreat
flexibility,though,comesgreatresponsibility—itisourresponsibilitythatweensurethat
k-NNhasn’toverfitthetrainingdata.
Figure9.2:Thespeciesclassificationregionsoftheirisdatasetusing1-NN
InFigure9.2,weusethebuilt-inirisdataset.Thisdatasetcontainsfourcontinuous
measurementsofirisflowersandmapseachobservationtooneofthreespecies:iris
setosa(thesquarepoints),irisvirginica(thecircularpoints),andirisversicolor(the
triangularpoints).Inthisexample,weuseonlytwooftheavailablefourattributesinour
classificationforeaseofvisualization:sepalwidthandpetalwidth.Asyoucansee,each
speciesseemstooccupyitsownlittlespaceinour2-Dfeaturespace.However,there
seemstobealittleoverlapbetweentheversicolorandvirginicadatapoints.Becausethis
classifierisusingonlyonenearestneighbor,thereappeartobesmallregionsoftraining
data-specificidiosyncraticclassificationbehaviorwherevirginicasisencroachingthe
versicolorclassificationregion.Thisiswhatitlookslikewhenourk-NNoverfitsthedata.
Inourtrainstationmetaphor,thisistantamounttoaskingonlyoneneighborwhatline
you’reonandthemisinformed(ormalevolent)neighbortellingyouthewronganswer.
k-NNclassifiersthathaveoverfithavetradedlowvarianceforlowbias.Itiscommonfor
overfitk-NNclassifierstohavea0%misclassificationrateonthetrainingdata,butsmall
changesinthetrainingdataharshlychangetheclassificationregions(highvariance).Like
withregression(andtherestoftheclassifierswe’llbelearningaboutinthischapter),we
aimtofindtheoptimalpointinthebias-variancetradeoff—theonethatminimizeserrorin
anindependenttestingset,andnotonethatminimizestrainingsetmisclassificationerror.
Wedothisbymodifyingthekink-NNandusingtheconsensusofmoreneighbors.
Beware-ifyouasktoomanyneighbors,youstarttotaketheanswersofratherdistant
neighborsseriously,andthiscanalsoadverselyaffectaccuracy.Findingthe“sweetspot”,
wherekisneithertoosmallortwolarge,iscalledhyperparameteroptimization(because
kiscalledahyperparameterofk-NN).
Figure9.3:Thespeciesclassificationregionsoftheirisdatasetusing15-NN.The
boundariesbetweentheclassificationregionsarenowsmootherandlessoverfit
CompareFigure9.2toFigure9.3,whichdepictstheclassificationregionsoftheiris
classificationtaskusing15nearestneighbors.Theaberrantvirginicasarenolonger
carvingouttheirownterritoryinversicolor’sregion,andtheboundariesbetweenthe
classificationregions(alsocalleddecisionboundaries)arenowsmoother—oftenatraitof
classifiersthathavefoundthesweetspotinthebias-variancetradeoff.Onecouldimagine
thatnewtrainingdatawillnolongerhavesuchadrasticeffectonthedecisionboundaries
—atleastnotasmuchaswiththe1-NNclassifier.
Note
Intheirisflowerexample,andthenextexample,wedealwithcontinuouspredictorsonly.
K-NNcanhandlecategoricalvariables,though—notunlikehowwedummycoded
categoricalvariablesinlinearregressioninthelastchapter!Thoughwedidn’ttalkabout
how,regression(andk-NN)handlesnon-binarycategoricalvariables,too.Canyouthink
ofhowthisisdone?Hint:wecan’tusejustonedummyvariableforanon-binary
categoricalvariable,andthenumberofdummyvariablesneededisonelessthanthe
numberofcategories.
Usingk-NNinR
Thedatasetwewillbeusingforalltheexamplesinthischapteristhe
PimaIndiansDiabetesdatasetfromthemlbenchpackage.Thisdatasetispartofthedata
collectedfromoneofthenumerousdiabetesstudiesonthePimaIndians,agroupof
indigenousAmericanswhohaveamongthehighestprevalenceofTypeIIdiabetesinthe
world—probablyduetoacombinationofgeneticfactorsandtheirrelativelyrecent
introductiontoaheavilyprocessedWesterndiet.For768observations,ithasnine
attributes,includingskinfoldthickness,BMI,andsoon,andabinaryvariable
representingwhetherthepatienthaddiabetes.Wewillbeusingtheeightpredictor
variablestotrainaclassifiertopredictwhetherapatienthasdiabetesornot.
Thisdatasetwaschosenbecauseithasmanyobservationsavailable,hasagoodlyamount
ofpredictorvariablesavailable,anditisaninterestingproblem.Additionally,itisnot
unlikemanyothermedicaldatasetsthathaveafewpredictorsandabinaryclassoutcome
(forexample,alive/dead,pregnant/not-pregnant,benign/malignant).Finally,unlikemany
classificationdatasets,thisonehasagoodmixtureofbothclassoutcomes;thiscontains
35%diabetespositiveobservations.Grievouslyimbalanceddatasetscancauseaproblem
withsomeclassifiersandimpairouraccuracyestimates.
Togetthisdataset,wearegoingtorunthefollowingcommandstoinstallthenecessary
package,loadthedata,andgivethedatasetanewnamethatisfastertotype:
>#"class"isoneofthepackagesthatimplementk-NN
>#"chemometrics"containsafunctionweneed
>#"mlbench"holdsthedataset
>install.packages(c("class","mlbench","chemometrics"))
>library(class)
>library(mlbench)
>data(PimaIndiansDiabetes)
>PID<-PimaIndiansDiabetes
Now,let’sdivideourdatasetintoatrainingsetandatestingsetusingan80/20split.
>#wesettheseedsothatoursplitsarethesame
>set.seed(3)
>ntrain<-round(nrow(PID)*4/5)
>train<-sample(1:nrow(PID),ntrain)
>training<-PID[train,]
>testing<-PID[-train,]
Nowwehavetochoosehowmanynearestneighborswewanttouse.Luckily,there’sa
greatfunctioncalledknnEvalfromthechemometricspackagethatwillallowusto
graphicallyvisualizetheeffectivenessofk-NNwithadifferentkusingcross-validation.
Ourobjectivemeasuresofeffectivenesswillbethemisclassificationrate,or,thepercent
oftestingobservationsthataremisclassified.
>resknn<-knnEval(scale(PID[,-9]),PID[,9],train,kfold=10,
+knnvec=seq(1,50,by=1),
+legpos="bottomright")
There’salotheretoexplain!Thefirstthreeargumentsarethepredictormatrix,the
variablestopredict,andtheindicesofthetrainingdatasetrespectively.Notethatthe
ninthcolumnofthePIDdataframeholdstheclasslabels—togetamatrixcontainingjust
thepredictors,wecanremovetheninthcolumnbyusinganegativecolumnindex.The
scalefunctionthatwecallonthepredictormatrixsubtractseachvalueineachcolumnby
thecolumn’smeananddivideseachvaluebytheirrespectivecolumn’sstandarddeviation
—itconvertseachvaluetoaz-score!Thisisusuallyimportantink-NNinorderforthe
distancesbetweendatapointstobemeaningful.Forexample,thedistancebetweendata
pointswouldchangedrasticallyifacolumnpreviouslymeasuredinmeterswerererepresentedasmillimeters.Thescalefunctionputsallthefeaturesincomparableranges
regardlessoftheoriginalunits.
Notethatforthethirdargument,wearenotsupplyingthefunctionwiththetrainingdata
set,buttheindicesthatweusedtoconstructthetrainingdataset.Ifyouareconfused,
inspectthevariousobjectswehaveinourworkspacewiththeheadfunction.
Thefinalthreeargumentsindicatethatwewanttousea10-foldcross-validation,check
everyvalueofkfrom1to50,andputthelegendinthelower-leftcorneroftheplot.
TheplotthatthiscodeproducesisshowninFigure9.4:
Figure9.4:Aplotillustratingtestseterror,cross-validatederror,andtrainingseterroras
afunctionofkink-NN.Afteraboutk=15,thetestandCVerrordoesn’tappeartochange
much
Asyoucanseefromtheprecedingplot,afteraboutk=15,thetestandcross-validated
misclassificationerrordon’tseemtochangemuch.Usingk=27seemslikeasafebet,as
measuredbytheminimizationofCVerror.
Note
Toseewhatitlookslikewhenweunderfitandusetoomanyneighbors,checkoutFigure
9.5,whichexpandsthex-axisofthelastfiguretoshowthemisclassificationerrorofusing
upto200neighbors.NoticethatthetestandCVerrorstartoffhigh(at1-NN)andquickly
decrease.Atabout70-NN,though,thetestandCVerrorstarttorisesteadilyasthe
classifierunderfits.Notealsothatthetrainingerrorstartsoutat0for1-NN(aswewould
expect),butverysharplyquicklyincreasesasweaddmoreneighbors.Thisisagood
reminderthatourgoalisnottominimizethetrainingseterrorbuttominimizeerroronan
independentdataset—eitheratestsetoranestimateusingcross-validation.
Figure9.5:Aplotillustratingtestseterror,cross-validatederror,andtrainingseterror
andafunctionofkink-NNuptok=200.Noticehowerrorincreasesasthenumberof
neighborsbecomestoolargeandcausestheclassifiertooverfit.
Let’sperformthek-NN!
>predictions<-knn(scale(training[,-9]),
+scale(testing[,-9]),
+training[,9],k=27)
>
>#functiontogivecorrectclassificationrate
>accuracy<-function(predictions,answers){
+sum((predictions==answers)/(length(answers)))
+}
>
>accuracy(predictions,testing[,9])
[1]0.7597403
Itlookslikeusing27-NNgaveusacorrectclassificationrateof76%(amisclassification
rateof100%-76%=24%).Isthatgood?Well,let’sputitinperspective.
Ifwerandomlyguessedwhethereachtestingobservationwaspositivefordiabetes,we
wouldexpectaclassificationrateof50%.Butrememberthatthenumberofnon-diabetes
observationsoutnumberthenumberofobservationsofdiabetes(non-diabetes
observationsare65%ofthetotal).So,ifwebuiltaclassifierthatjustpredictedno
diabetesforeveryobservation,wewouldexpecta65%correctclassificationrate.Luckily,
ourclassifierperformssignificantlybetterthanournaïveclassifier,although,perhaps,not
asgoodaswewouldhavehoped.Aswe’lllearnasthechaptermoveson,k-NNis
competitivewiththeaccuracyofotherclassifiers—Iguessit’sjustareallyhardproblem!
Confusionmatrices
Wecangetamoredetailedlookatourclassifier’saccuracyviaaconfusionmatrix.You
cangetRtogiveupaconfusionmatrixwiththefollowingcommand:
>table(test[,9],preds)
preds
negpos
neg869
pos2831
Thecolumnsinthismatrixrepresentourclassifier’spredictions;therowsrepresentthe
trueclassificationsofourtestingsetobservations.IfyourecallfromChapter3,
DescribingRelationships,thismeansthattheconfusionmatrixisacross-tabulation(or
contingencytable)ofourpredictionsandtheactualclassifications.Thecellinthetop-left
cornerrepresentsobservationsthatdidn’thavediabetesthatwecorrectlypredictedasnondiabetic(truenegatives).Incontrast,thecellinthelower-rightcornerrepresentstrue
positives.Theupper-leftcellcontainsthecountoffalsepositives,observationsthatwe
incorrectlypredictedashavingdiabetes.Finally,theremainingcellholdsthenumberof
falsenegatives,ofwhichthereare28.
Thisishelpfulforexaminingwhetherthereisaclassthatwearesystematically
misclassifyingorwhetherourfalsenegativesandfalsepositivearesignificantly
imbalanced.Additionally,thereareoftendifferentcostsassociatedwithfalsenegatives
andfalsepositives.Forexample,inthiscase,thecostofmisclassifyingapatientasnondiabeticisgreat,becauseitimpedesourabilitytohelpatrulydiabeticpatient.Incontrast,
misclassifyinganon-diabeticpatientasdiabetic,althoughnotideal,incursafarless
grievouscost.Aconfusionmatrixletsusview,ataglance,justwhattypesoferrorsweare
making.Fork-NN,andtheotherclassifiersinthischapter,therearewaystospecifythe
costofeachtypeofmisclassificationinordertoexactaclassifieroptimizedfora
particularcost-sensitivedomain,butthatisbeyondthescopeofthisbook.
Limitationsofk-NN
Beforewemoveon,weshouldtalkaboutsomeofthelimitationsofk-NN.
First,ifyou’renotcarefultouseanoptimizedimplementationofk-NN,classificationcan
beslow,sinceitrequiresthecalculationofthetestdatapoint’sdistancetoeveryotherdata
point;sophisticatedimplementationshavemechanismsforpartiallyhandlingthis.
Second,vanillak-NNcanperformpoorlywhentheamountofpredictorvariables
becomestoolarge.Intheirisexample,weusedonlytwopredictors,whichcanbeplotted
intwo-dimensionalspacewheretheEuclideandistanceisjustthe2-DPythagorean
theoremthatwelearnedinmiddleschool.Aclassificationproblemwithnpredictorsis
representedinn-dimensionalspace;theEuclideandistancebetweentwopointsinhigh
dimensionalspacecanbeverylarge,evenifthedatapointsaresimilar.This,andother
complicationsthatarisefrompredictiveanalyticstechniquesusingahigh-dimensional
featurespaces,is,colloquially,knownasthecurseofdimensionality.Itisnotuncommon
formedical,image,orvideodatatohavehundredsoreventhousandsofdimensions.
Luckily,therearewaysofdealingwiththesesituations.Butlet’snotdwellthere.
Logisticregression
RememberwhenIsaid,athoroughunderstandingoflinearmodelswillpayenormous
dividendsthroughoutyourcareerasananalystinthepreviouschapter?Well,Iwasn’t
lying!Thisnextclassifierisaproductofageneralizationoflinearregressionthatcanact
asaclassifier.
Whatifweusedlinearregressiononabinaryoutcomevariable,representingdiabetesas1
andnotdiabetesas0?Weknowthattheoutputoflinearregressionisacontinuous
prediction,butwhatif,insteadofpredictingthebinaryclass(diabetesornotdiabetes),we
attemptedtopredicttheprobabilityofanobservationhavingdiabetes?Sofar,theideais
totrainalinearregressiononatrainingsetwherethevariableswearetryingtopredictare
adummy-coded0or1,andthepredictionsonanindependenttrainingsetareinterpreted
asacontinuousprobabilityofclassmembership.
Itturnsoutthisideaisnotquiteascrazyasitsounds—theoutcomeofthepredictionsare
indeedproportionaltotheprobabilityofeachobservation’sclassmembership.Thebiggest
problemisthattheoutcomeisonlyproportionaltotheclassmembershipprobabilityand
can’tbedirectlyinterpretedasatrueprobability.Thereasonissimple:probabilityis,
indeed,acontinuousmeasurement,butitisalsoaconstrainedmeasurement—itis
boundedby0and1.Withregularoldlinearregression,wewilloftengetpredicted
outcomesbelow0andabove1,anditisunclearhowtointerpretthoseoutcomes.
Butwhatifwehadawayoftakingtheoutcomeofalinearregression(alinear
combinationofbetacoefficientsandpredictors)andapplyingafunctiontoitthat
constrainsittobebetween0and1sothatitcanbeinterpretedasaproperprobability?
Luckily,wecandothiswiththelogisticfunction:
whoseplotisdepictedinFigure9.6:
Figure9.6:Thelogisticfunction
Notethatnomatterwhatvalueofx(theoutputofthelinearregression)weuse—from
negativeinfinitytopositiveinfinity—they(theoutputofthelogisticfunction)isalways
between0and1.Nowwecanadaptlinearregressiontooutputprobabilities!
Thefunctionthatweapplytothelinearcombinationofpredictorstochangeitintothe
kindofpredictionwewantiscalledtheinverselinkfunction.Thefunctionthattransforms
thedependentvariableintoavaluethatcanbemodeledusinglinearregressionisjust
calledthelinkfunction.Inlogisticregression,thelinkfunction(whichistheinverseofthe
inverselinkfunction,thelogisticfunction)iscalledthelogitfunction.
Beforewegetstartedusingthispowerfulideaonourdata,therearetwootherproblems
thatwemustcontendwith.Thefirstisthatwecan’tuseordinaryleastsquarestosolvefor
thecoefficientsanymore,becausethelinkfunctionisnon-linear.Moststatisticalsoftware
solvesthisproblemusingatechniquecalledMaximumLikelihoodEstimation(MLE)
instead,thoughthereareotheralternatives.
Thesecondproblemisthatanassumptionoflinearregression(ifyourememberfromlast
chapter)isthattheerrordistributionisnormallydistributed.Inthecontextoflinear
regression,thisdoesn’tmakesense,becauseitisabinarycategoricalvariable.So,logistic
regressionmodelstheerrordistributionasaBernoullidistribution(orabinomial
distribution,dependingonhowyoulookatit).
Note
GeneralizedLinearModel(GLM)
Ifyouaresurprisedthatlinearregressioncanbegeneralizedenoughtoaccommodate
classification,preparetobeastonishedbygeneralizedlinearmodels!
GLMsareageneralizationofregularlinearregressionthatallowforotherlinkfunctions
tomapfromlinearmodeloutputtothedependentvariable,andothererrordistributionsto
describetheresiduals.Inlogisticregression,thelinkfunctionanderrordistributionisthe
logitandbinomialrespectively.Inregularlinearregression,thelinkfunctionisthe
identityfunction(afunctionthatreturnsitsargumentunchanged),andtheerror
distributionisthenormaldistribution.
Besidesregularlinearregressionandlogisticregression,therearestillotherspeciesof
GLMthatuseotherlinkfunctionsanderrordistributions.AnothercommonGLMis
Poissonregression,atechniquethatisusedtopredict/modelcountdata(numberoftraffic
stops,numberofredcards,andsoon),whichusesthelogarithmasthelinkfunctionand
thePoissondistributionasitserrordistribution.Theuseoftheloglinkfunctionconstrains
theresponsevariable(thedependentvariable)sothatitisalwaysabove0.
Rememberthatweexpressedthet-testandANOVAintermsofthelinearmodel?Sothe
GLMencompassesnotonlylinearregression,logisticregression,Poissonregression,and
thelike,butitalsoencompassest-tests,ANOVA,andtherelatedtechniquecalled
ANCOVA(AnalysisofCovariance).Prettycool,eh?!
UsinglogisticregressioninR
Performinglogisticregression—anadvancedandwidelyusedclassificationmethod—
couldscarcelybeeasierinR.Tofitalogisticregression,weusethefamiliarglmfunction.
Thedifferencenowisthatwe’llbespecifyingourownerrordistributionandlinkfunction
(theglmcallsoflastchapterassumedwewantedtheregularlinearregressionerror
distributionandlinkfunction,bydefault).Thesearespecifiedinthefamilyargument:
>model<-glm(diabetes~.,data=PID,family=binomial(logit))
Here,webuildalogisticregressionusingallavailablepredictorvariables.
Youmayalsoseelogisticregressionsbeingperformedwherethefamilyargumentlooks
likefamily="binomial"orfamily=binomial()—it’sallthesamething,Ijustlikebeing
moreexplicit.
Let’slookattheoutputfromcallingsummaryonthemodel.
>summary(model)
Call:
glm(formula=diabetes~.,family=binomial(logit),data=PID)
DevianceResiduals:
Min1QMedian3QMax
-2.5566-0.7274-0.41590.72672.9297
Coefficients:
EstimateStd.ErrorzvaluePr(>|z|)
(Intercept)-8.40469640.7166359-11.728<2e-16***
pregnant0.12318230.03207763.8400.000123***
glucose0.03516370.00370879.481<2e-16***
pressure-0.01329550.0052336-2.5400.011072*
...
Theoutputissimilartothatofregularlinearregression;forexample,westillgetestimates
ofthecoefficientsandassociatedp-values.Theinterpretationofthebetacoefficients
requiresalittlemorecarethistimearound,though.Thebetacoefficientofpregnant,
0.123,meansthataoneunitincreaseinpregnant(anincreaseinthenumberoftimes
beingpregnantbyone)isassociatedwithanincreaseofthelogarithmoftheoddsofthe
observationbeingdiabetic.Ifthisisconfusing,concentrateonthefactthatifthe
coefficientispositive,ithasapositiveimpactonprobabilityofthedependentvariable,
andifthecoefficientisnegative,ithasanegativeimpactontheprobabilityofthebinary
outcome.Whetherpositivemeanshigherprobabilityofdiabetesorhigherprobabilityof
notdiabetes‘dependsonhowyourbinarydependentvariableisdummy-coded.
Tofindthetrainingsetaccuracyofourmodel,wecanusetheaccuracyfunctionwewrote
fromthelastsection.Inordertouseitcorrectly,though,weneedtoconvertthe
probabilitiesintoclasslabels,asfollows:
>predictions<-round(predict(model,type="response"))
>predictions<-ifelse(predictions==1,"pos","neg")
>accuracy(predictions,PID$diabetes)
[1]0.7825521
Cool,wegeta78%accuracyonthetrainingdata,butremember:ifweoverfit,our
trainingsetaccuracywillnotbeareliableestimateofperformanceonanindependent
dataset.Inordertotestthismodel’sgeneralizability,let’sperformk-foldcross-validation,
justlikeinthepreviouschapter!
>set.seed(3)
>library(boot)
>cv.err<-cv.glm(PID,model,K=5)
>cv.err$delta[2]
[1]0.154716
>1-cv.err$delta[2]
[1]0.845284
Wow,ourCV-estimatedaccuracyrateis85%!Thisindicatesthatitishighlyunlikelythat
weareoverfitting.IfyouarewonderingwhywewereusingallavailablepredictorsafterI
saidthatdoingsowasdangerousbusinessinthelastchapter,it’sbecausethoughtheydo
makethemodelmorecomplex,theextrapredictorsdidn’tcausethemodeltooverfit.
Finally,let’stestthemodelontheindependenttestsetsothatwecancomparethis
model’saccuracyagainstk-NN’s:
>predictions<-round(predict(model,type="response",
+newdata=test))
>predictions<-ifelse(predictions==1,"pos","neg")
>accuracy(predictions,test[,9])#78%
[1]0.7792208
Nice!A78%accuracyrate!
Itlookslikelogisticregressionmayhavegivenusaslightimprovementoverthemore
flexiblek-NN.Additionally,themodelgivesusatleastalittletransparencyintowhyeach
observationisclassifiedthewayitis—aluxurynotavailabletousviak-NN.
Beforewemoveon,it’simportanttodiscusstwolimitationsoflogisticregression.
Thefirstisthatlogisticregressionproperdoesnothandlenon-binarycategorical
variables—variableswithmorethantwolevels.Thereexistsageneralizationof
logisticregression,calledmultinomialregression,thatcanhandlethissituation,but
thisisvastlylesscommonthanlogisticregression.Itis,therefore,morecommonto
seeanotherclassifierbeingusedforanon-binaryclassificationproblem.
Thelastlimitationoflogisticregressionisthatitresultsinalineardecisionboundary.
Thismeansthatifabinaryoutcomeisnoteasilyseparatedbyaline,plane,or
hyperplane,thenlogisticregressionmaynotbethebestroute.Mayintheprevious
sentenceisitalicizedbecausetherearetricksyoucanusetogetlogisticregressionto
spitoutanon-lineardecisionboundary—sometimes,ahighperformingone—as
we’llseeinthesectiontitledChoosingaclassifier.
Decisiontrees
Wenowmoveontooneoftheeasilyinterpretableandmostpopularclassifiersthereare
outthere:thedecisiontree.Decisiontrees—whichlooklikeanupsidedowntreewiththe
trunkontopandtheleavesonthebottom—playanimportantroleinsituationswhere
classificationdecisionshavetobetransparentandeasilyunderstoodandexplained.Italso
handlesbothcontinuousandcategoricalpredictors,outliers,andirrelevantpredictors
rathergracefully.Finally,thegeneralideasbehindthealgorithmsthatcreatedecisiontrees
arequiteintuitive,thoughthedetailscansometimesgethairy.
Figure9.7depictsasimpledecisiontreedesignedtoclassifymotorvehiclesintoeither
motorcycles,golfcarts,orsedans.
Figure9.7:Asimpleandillustrativedecisiontreethatclassifiesmotorvehiclesintoeither
motorcycles,golfcarts,andsedans
Thisisarathersimpledecisiontreewithonlythreeleaves(terminalnodes)andtwo
decisionpoints.Notethatthefirstdecisionpointis(a)onabinarycategoricalvariable,
and(b)resultsinoneterminalnode,motorcycle.Theotherbranchcontainstheother
decisionpoint,acontinuousvariablewithasplitpoint.Thissplitpointwaschosen
carefullybythedecisiontree-creatingalgorithmtoresultinthemostinformativesplit—
theonethatbestclassifiestherestoftheobservationsasmeasuredbythemisclassification
rateofthetrainingdata.
Note
Actually,inmostcases,thedecisiontree-creatingalgorithmdoesn’tchooseasplitthat
resultsinthelowestmisclassificationrateofthetrainingdata,butchoosesonthatwhich
minimizeseithertheGinicoefficientorcrossentropyoftheremainingtraining
observations.Thereasonsforthisaretwo-fold:(a)boththeGinicoefficientandcross
entropyhavemathematicalpropertiesthatmakethemmoreeasilyamendabletonumerical
optimization,and(b)itgenerallyresultsinafinaltreewithlessbias.
Theoverallideaofthedecisiontree-growingalgorithm,recursivesplitting,issimple:
1. Step1:Chooseavariableandsplitpointthatresultsinthebestclassification
outcomes.
2. Step2:Foreachoftheresultingbranches,checktoseeifsomestoppingcriteriais
met.Ifso,leaveitalone.Ifnot,moveontonextstep.
3. Step3:RepeatStep1onthebranchesthatdonotmeetthestoppingcriteria.
Thestoppingcriterionisusuallyeitheracertaindepth,whichthetreecannotgrowpast,or
aminimumnumberofobservations,forwhichaleafnodecannotfurtherclassify.Bothof
thesearehyper-parameters(alsocalledtuningparameters)ofthedecisiontreealgorithm
—justlikethekink-NN—andmustbefiddledwithinordertoachievethebestpossible
decisiontreeforclassifyinganindependentdataset.
Adecisiontree,ifnotkeptincheck,cangrosslyoverfitthedata—returninganenormous
andcomplicatedtreewithaminimumleafnodesizeof1—resultinginanearlybias-less
classificationmechanismwithprodigiousvariance.Topreventthis,eitherthetuning
parametersmustbechosencarefullyorahugetreecanbebuiltandcutdowntosize
afterward.Thelattertechniqueisgenerallypreferredandis,quiteappropriately,called
pruning.Themostcommonpruningtechniqueiscalledcostcomplexitypruning,where
complexpartsofthetreethatprovidelittleinthewayofclassificationpower,asmeasured
byimprovementofthefinalmisclassificationrate,arecutdownandremoved.
Enoughtheory—let’sgetstarted!First,we’llgrowafulltreeusingthePIDdatasetand
plottheresult:
>library(tree)
>our.big.tree<-tree(diabetes~.,data=training)
>summary(our.big.tree)
Classificationtree:
tree(formula=diabetes~.,data=training)
Variablesactuallyusedintreeconstruction:
[1]"glucose""age""mass""pedigree""triceps""pregnant"
[7]"insulin"
Numberofterminalnodes:16
Residualmeandeviance:0.7488=447.8/598
Misclassificationerrorrate:0.184=113/614
>plot(our.big.tree)
>text(our.big.tree)
TheresultingplotisdepictedinFigure9.10.
Figure9.8:Anunprunedandcomplexdecisiontree
Thepowerofadecisiontree—whichisusuallynotcompetitivewithotherclassification
mechanisms,accuracy-wise—isthattherepresentationofthedecisionrulesare
transparent,easytovisualize,andeasytoexplain.Thistreeisratherlargeandunwieldy,
whichhindersitsabilitytobeunderstood(ormemorized)ataglance.Additionally,forall
itscomplexity,itonlyachievesan81%accuracyrateonthetrainingdata(asreportedby
thesummaryfunction).
Wecan(andwill)dobetter!Next,wewillbeinvestigatingtheoptimalsizeofthetree
employingcross-validation,usingthecv.treefunction.
>set.seed(3)
>cv.results<-cv.tree(our.big.tree,FUN=prune.misclass)
>plot(cv.results$size,cv.results$dev,type="b")
Intheprecedingcode,wearetellingthecv.treefunctionthatwewanttopruneourtree
usingthemisclassificationrateasourobjectivemetric.Then,weareplottingtheCVerror
rate(dev)andafunctionoftreesize(size).
Figure9.9:Aplotcross-validatedmisclassificationerrorasafunctionoftreesize.
Observethattreeofsizeoneperformsterribly,andthattheerrorratesteeplydeclines
beforerisingslightlyasthetreeisoverfitandlargesizes.
Asyoucanseefromtheoutput(showninFigure9.9),theoptimalsize(numberof
terminalnodes)ofthetreeseemstobefive.However,atreeofsizethreeisnotterribly
lessperformantthanatreeofsizefive;so,foreaseofvisualization,interpretation,and
memorization,wewillbeusingafinaltreewiththreeterminalnodes.Toactuallyperform
thepruning,wewillbeusingtheprune.misclassfunction,whichtakesthesizeofthe
treeasanargument.
>pruned.tree<-prune.misclass(our.big.tree,best=3)
>plot(pruned.tree)
>text(pruned.tree)
>#let'stestitsaccuracy
>pruned.preds<-predict(pruned.tree,newdata=test,type="class")
>accuracy(pruned.preds,test[,9])#71%
[1]0.7077922
ThefinaltreeisdepictedinFigure9.10.
Figure9.10:Simplerdecisiontreewiththesametestingsetperformanceasthetreein
Figure9.8
Rad!Atreesosimpleitcanbeeasilymemorizedbymedicalpersonnelandachievesthe
sametesting-setaccuracyastheunwieldytreeinfigure9.8:71%!Nowtheaccuracyrate,
byitself,isnothingtowritehomeabout,particularlybecausethenaïveclassifierachieves
a65%accuracyrate.Nevertheless,thefactthatasignificantlybetterclassifiercanbebuilt
fromtwosimplerules—closelyfollowingthelogicphysiciansemploy,anyway—iswhere
decisiontreeshaveahugeleguprelativetoothertechniques.Further,wecouldhave
bumpedupthisaccuracyratewithmoresamplesandmorecarefulhyper-parameter
tuning.
Randomforests
ThefinalclassifierthatwewillbediscussinginthischapteristheaptlynamedRandom
Forestandisanexampleofameta-techniquecalledensemblelearning.Theideaand
logicbehindrandomforestsfollowsthusly:
Giventhat(unpruned)decisiontreescanbenearlybias-lesshighvarianceclassifiers,a
methodofreducingvarianceatthecostofamarginalincreaseofbiascouldgreatly
improveuponthepredictiveaccuracyofthetechnique.Onesalientapproachtoreducing
varianceofdecisiontreesistotrainabunchofunpruneddecisiontreesondifferent
randomsubsetsofthetrainingdata,samplingwithreplacement—thisiscalledbootstrap
aggregatingorbagging.Attheclassificationphase,thetestobservationisrunthroughall
ofthesetrees(aforest,perhaps?),andeachresultingclassificationcastsavoteforthefinal
classificationofthewholeforest.Theclasswiththehighestnumberofvotesisthewinner.
Itturnsoutthattheconsensusamongmanyhigh-variancetreesonbootstrappedsubsetsof
thetrainingdataresultsinasignificantaccuracyimprovementandvastlydecreased
variance.
Note
Trèsbienensemble!
Baggingisoneexampleofanensemblemethod—ameta-techniquethatusesmultiple
classifierstoimprovepredictiveaccuracy.Nearlybias-less/high-varianceclassifiersare
theonesthatseemtobenefitthemostfromensemblemethods.Additionally,ensemble
methodsareeasiesttousewithclassifiersthatarecreatedandtrainedrapidly,since
methodipsofactoreliesonalargenumberofthem.Decisiontreesfitallofthese
characteristics,andthisaccountsforwhybaggedtreesandrandomforestsarethemost
commonensemblelearninginstruments.
Sofar,whatwehavechronicleddescribesatechniquecalledbaggedtrees.Butrandom
forestshaveonemoretrickuptheirsleeves!Observingthatthevariancecanbefurther
reducedbyforcingthetreestobelesssimilar,randomforestsdifferfrombaggedtreesby
forcingthetreetoonlyuseasubsetofitsavailablepredictorstosplitoninthegrowing
phase.
Manypeoplebeginconfusedastohowdeliberatelyreducingtheefficacyofthe
componenttreescanpossiblyresultinamoreaccurateensemble.Toclearthisup,
considerthatafewveryinfluentialpredictorswilldominatetheexpressionofthetrees,
evenifthesubsetscontainlittleoverlap.Byconstrainingthenumberofpredictorsatree
canuseoneachsplittingphase,amorediversecropoftreesisbuilt.Thisresultsina
forestwithlowervariancethanaforestwithnoconstraints.
Randomforestsarethemoderndarlingofclassifiers—andforgoodreason.Forone,they
areoftenextraordinarilyaccurate.Second,sincerandomforestsuseonlytwohyperparameters(thenumberoftreestouseintheforestandthenumberofpredictorstouseat
eachstepofthesplittingprocess),theyareveryeasytocreate,andrequirelittleintheway
ofhyper-parametertuning.Third,itisextremelydifficultforarandomforesttooverfit,
anditdoesn’thappenveryoftenatall,inpractice.Forexample,increasingthenumberof
treesthatmakeuptheforestdoesnotcausetheforesttooverfit,andfiddlingwiththe
number-of-predictorshyper-parametercan’tpossiblyresultinaforestwithahigher
variancethanthatofthecomponenttreethatoverfitsthemost.
Onelastawesomepropertyoftherandomforestisthatthetrainingerrorratethatitreports
isanearlyunbiasedestimatorcross-validatederrorrate.Thisisbecausethetrainingerror
rate,atleastthatRreportsusingthepredictfunctiononarandomForestwithno
newdataargument,istheaverageerrorrateoftheclassifiertestedonalltheobservations
thatwerekeptoutofthetrainingsampleateachstageofthebootstrapaggregation.
Becausethesewereindependentobservations,andnotusedfortraining,itclosely
approximatestheCVerrorrate.Theerrorratereportedontheremainingobservationsleft
outofthesampleateverybaggingstepiscalledtheOut-Of-Bag(OOB)errorrate.
Theprimarydrawbacktorandomforestsisthatthey,tosomeextent,revokethechief
benefitofdecisiontrees:theirinterpretability;itisfarhardertovisualizethebehaviorofa
randomforestthanitisforanyofthecomponenttrees.Thisputstheinterpretabilityof
randomforestssomewherebetweenlogisticregression(whichismarginallymore
interpretable)andk-NN(whichislargelyun-interpretable).
Atlonglast,let’susearandomforestonourdatasettoclassifyobservationsasbeing
positiveornegativefordiabetes!
>library(randomForest)
>forest<-randomForest(diabetes~.,data=training,
+importance=TRUE,
+ntree=2000,
+mtry=5)
>accuracy(predict(forest),training[,9])
[1]0.7654723
>predictions<-predict(forest,newdata=test)
>accuracy(predictions,test[,9])
[1]0.7727273
Inthisincantation,wesetthenumberoftrees(ntree)toanarbitrarilyhighnumberand
setthenumberofpredictors(mtry)to5.Thoughitisnotshownabove,IusedtheOOB
errorratetoguideinthechoosingofthishyper-parameter.Hadweleftitblank,itwould
havedefaultedtothesquarerootofthenumberoftotalpredictors.
Asyoucanseefromtheoutputofouraccuracyfunction,therandomforestiscompetitive
withtheperformanceofourhighestperforming(onthisdataset,atleast)classifier:
logisticregression.Onotherdatasets,withothercharacteristics,randomforestssometimes
blowthecompetitionoutofthewater.
Choosingaclassifier
Thesearejustfourofthemostpopularclassifiersoutthere,buttherearemanymoreto
choosefrom.Althoughsomeclassificationmechanismsperformbetteronsometypesof
datasetsthanothers,itcanbehardtodevelopanintuitionforexactlytheonestheyare
suitablefor.Inordertohelpwiththis,wewillbeexaminingtheefficacyofourfour
classifiersonfourdifferenttwo-dimensionalmade-updatasets—eachwithavastly
differentoptimaldecisionboundary.Indoingso,wewilllearnmoreaboutthe
characteristicsofeachclassifierandhaveabettersenseofthekindsofdatatheymightbe
bettersuitedfor.
ThefourdatasetsaredepictedinFigure9.11:
Figure9.11:Aplotdepictingtheclasspatternsofourfourillustrativeandcontrived
datasets
Theverticaldecisionboundary
Thefirstcontriveddatasetwewillbelookingatistheoneonthetop-leftpanelofFigure
9.11.Thisisarelativelysimpleclassificationproblem,because,justbyvisualinspection,
youcantellthattheoptimaldecisionboundaryisaverticalline.Let’sseeeachofour
classifiersfaironthisdataset:
Figure9.12:Aplotofthedecisionboundariesofourfourclassifiersonourfirstcontrived
dataset
Asyoucansee,allofourclassifiersperformedwellonthissimpledataset;allofthe
methodsfindanappropriatestraightverticallinethatismostrepresentativeoftheclass
division.Ingeneral,logisticregressionisgreatforlineardecisionboundaries.Decision
treesalsoworkwellforstraightdecisionboundaries,aslongastheboundariesare
orthogonaltotheaxes!Observethenextdataset.
Thediagonaldecisionboundary
Theseconddatasetsportsanoptimaldecisionboundarythatisadiagonalline—onethatis
notorthogonaltotheaxes.Here,westarttoseesomecoolbehaviorfromcertain
classifiers.
Figure9.13:Aplotofthedecisionboundariesofourfourclassifiersonoursecond
contriveddataset
Thoughallfourclassifierswerereasonablyeffectiveinthisdataset’sclassification,we
starttoseeeachoftheclassifiers’personalitycomeout.First,thek-NNcreatesa
boundarythatcloselyapproximatestheoptimalone.Thelogisticregression,amazingly,
throwsaperfectlinearboundaryattheexactrightspot.
Thedecisiontree’sboundaryiscurious;itismadeupofperpendicularzig-zags.Though
theoptimaldecisionboundaryislinearintheinputspace,thedecisiontreecan’tcapture
itsessence.Thisisbecausedecisiontreesonlysplitonafunctionofonevariableatatime.
Thus,datasetswithcomplexinteractionsmaynotbethebestonestoattackwitha
decisiontree.
Finally,therandomforest,beingcomposedofsufficientlyvarieddecisiontrees,isableto
capturethespiritoftheoptimalboundary.
Thecrescentdecisionboundary
Thisthirddataset,depictedinthebottom-leftpanelofFigure9.11,exhibitsaverynonlinearclassificationpattern:
Figure9.14:Aplotofthedecisionboundariesofourfourclassifiersonourthird
contriveddataset
Intheprecedingfigure,ourtopperformersarek-NN—whichishighlyeffectivewithnonlinearboundaries—andrandomforest—whichissimilarlyeffective.Thedecisiontreeisa
littletoojaggedtocompeteatthetoplevel.Buttherealloserhereislogisticregression.
Becauselogisticregressionreturnslineardecisionboundaries,itisineffectiveat
classifyingthesedata.
Tobefair,withalittlefinesse,logisticregressioncanhandletheseboundaries,too,as
we’llseeinthelastexample.However,inhighlynon-linearsituations,wherethenatureof
thenon-linearboundaryisunknown—orunknowable—logisticregressionisoften
outperformedbyotherclassifiersthatnativelyhandlethesesituationswithease.
Thecirculardecisionboundary
Thelastdatasetwewillbelookingat,likethepreviousone,containsanon-linear
classificationpattern.
Figure9.15:Aplotofthedecisionboundariesofourfourclassifiersonourfourth
contriveddataset
Again,justlikeinthelastcase,thewinnersarek-NNandrandomforest,followedbythe
decisiontreewithitsjaggededges.And,again,thelogisticregressionunproductively
throwsalinearboundaryatadistinctivelynot-linearpattern.However,statingthatlogistic
regressionisunsuitableforproblemsofthistypeisbothnegligentanddeadwrong.
Withaslightchangeintheincantationofthelogisticregression,thewholegameis
changed,andlogisticregressionbecomestheclearwinner:
>model<-glm(factor(dep.var)~ind.var1+
+I(ind.var1^2)+ind.var2+I(ind.var2^2),
+data=this,family=binomial(logit))
Figure9.16:Asecond-order(quadratic)logisticregressiondecisionboundary
Intheprecedingfigure,insteadofmodelingthebinarydependentvariable(dep.var)asa
linearcombinationofsolelythetwoindependentvariables(ind.var1andind.var2),we
modelitasafunctionofthosetwovariablesandthosetwovariablessquared.Theresultis
stillalinearcombinationoftheinputs(beforetheinverselinkfunction),butnowthe
inputscontainnon-lineartransformationsoftheotheroriginalinputs.Thisgeneral
techniqueiscalledpolynomialregressionandcanbeusedtocreateawidevarietyofnonlinearboundaries.Inthisexample,justsquaringtheinputs(resultinginaquadratic
polynomial)outputsaclassificationcirclethatexactlymatchestheoptimaldecision
boundary,asyoucanseeinFigure9.16.Cubingtheoriginalinputs(creatingacubic
polynomial)sufficestodescribetheboundaryinthepreviousexample.
Infact,alogisticregressioncontainingpolynomialtermsofarbitrarilylargeordercanfit
anydecisionboundary—nomatterhownon-linearandcomplicated.Careful,though!
Usinghighorderpolynomialsisagreatwaytomakesureyouoverfityourdata.
Mygeneraladviceistoonlyusepolynomialregressionforcaseswhereyouknowapriori
whatpolynomialformyourboundariestakeon—likeanellipse!Ifyoumustexperiment,
keepacloseeyeonyourcross-validatederrorratetomakesureyouarenotfooling
yourselfintothinkingthatyouaredoingtherightthingtakingonmoreandmore
polynomialterms.
Exercises
Practisethefollowingexercisestogetafirmgraspontheconceptslearnedsofar:
DidyounoticethatIputCVinitalicswhenIsaid,Usingk=27seemslikeasafebet
asmeasuredbytheminimizationofCVerror?Didyouwonderwhy?I(quite
deliberately)madeagaffeinchoosingthekinthek-NNfromFigure9.4.Mychoice
wasn’twrong,perse,butmychoiceofkmayhavebeeninformedbydatathatshould
havebeenunavailabletome.HowmighthaveIcommittedacommonbutserious
errorinhyper-parametertuning?HowmightIhavedonethingsdifferently?
Rememberthatwespentalongtimetalkingabouttheassumptionsoflinear
regression?Incontrast,wespentvirtuallynotimediscussingtheassumptionsof
logisticregression.Althoughlogisticregressionhaslessstringentassumptionsthan
itscousin,itisnotassumption-free.Thinkaboutwhatsomeassumptionsoflogistic
regressionmightbe.Confirmyoursuspicionsbydoingresearchontheweb.My
omissionoftheassumptionswasnotoutoflaziness,and(again)itwasquite
deliberate.Asyouprogressinyourcareerofadataanalyst,youwilloftencome
acrossexcitingnewclassificationmethodsthatyouwill,nodoubt,wanttoputtouse
rightaway.Atraitthatwillsetyouapartfromyourmoreimpulsivecolleaguesisone
thatpromotescarefulexaminationandindependentresearchintowherethese
techniquescouldgowrong.
Youmaybesurprisedtolearnthatalloftheclassificationtechniquesthatwe
discussedinthischaptercanbeadaptedforuseinregression(predictingcontinuous
variables)!Theadaptationoflogisticregressionisobvious,butthinkabouthowyou
mightadapttheothersforuseinthispurpose.Dosomeresearchintoit.
TowhatextentcantherapiddismantlingoftheNewDealpoliciesafterthedeathof
Rooseveltbefactoredintotheconcurrentriseofneoliberaleconomicideasand
policiesofpost-warAmericanintellectualthought?
Summary
Atahighlevel,inthischapteryoulearnedaboutfourofthemostpopularclassifiersout
there:k-NearestNeighbors,logisticregression,decisiontrees,andrandomforests.Not
onlydidyoulearnthebasicsandmechanicsofthesefouralgorithms,butyousawhow
easytheyweretoperforminR.Alongtheway,youlearnedaboutconfusionmatrices,
hyper-parametertuning,andmaybeevenafewnewRincantations.
Wealsovisitedsomemoregeneralideas;forexample,you’veexpandedyour
understandingofthebias-variancetradeoff,sawhowtheGLMcanperformgreatfeats,
andbecameacquaintedwithensemblelearningandbootstrapaggregation.It’salsomy
hopethatyou’vedevelopedsomeintuitionastowhichclassifierstouseindifferent
situations.Finally,giventhatwecouldn’tachieveperfectclassificationonourdiabetes
dataset,Ihopethatyou’vegainedanappreciationfortheartanddifficultyof
classification.Perhapsyou’veevencaughtthestatisticallearningbugandwanttotryto
beatourperformanceinthischapter!Thatwouldbegreat!Therearecompetitionsonthe
webforpeoplejustlikeyou—anditisagreatwaytohoneyourskills.This,forbetteror
worse,concludesourunitonpredictiveanalytics.Inthefinalunit,wewillbediscussing
someofthetrialsandtribulationsofdataanalysisasittendstogoinpractice.Staytuned!
Chapter10.SourcesofData
Theprevioustwounits(ConfirmatoryDataAnalysisandInferentialStatisticsand
PredictiveAnalytics),havefocusedonteachingboththeoryandpracticeinidealdata
scenarios,sothatourmoreacademicquestscanbedivorcedfromoutsideconcernsabout
theveracityorformatofthedata.Tothisend,wedeliberatelystayedawayfromdatasets
notalreadybuilt-intoRoravailablefromadd-onpackages.ButveryfewpeopleIknow
getbyintheircareersusingRbynotimportinganydatafromsourcesoutsideRpackages.
Well,weverybrieflytoucheduponhowtoloaddataintoR(theread.*commands)inthe
veryfirstchapterofthisbook,didwenot?Soweshouldbeallset,right?
Here’stherub:IknowalmostasfewpeoplethatcangetbyusingsimpleCSVsandtabdelimitedtextlocallywiththeprimaryread.*commandsascangetbynotusingoutside
sourcesofdataatall!Theunfortunatefactisthatmanyintroductoryanalyticstextslargely
disregardthisreality.Thisproducesmanywell-informednewanalystswhoare
neverthelessstymiedontheirfirstattempttoapplytheirfreshknowledgeto“real-world
data”.Inmyopinion,anytextthatpurportstobeapracticalresourcefordataanalysts
cannotaffordtoignorethis.
Luckily,duetolargelyundirectedandunplannedpersonalresearchIdoforblogpostsand
myownedificationusingmotleycollectionsofpubliclyavailabledatasourcesinvarious
formats,I—perhapsdelusionally—considermyselffairlyadeptinnavigatingthis
criminallyoverlookedportionofpracticalanalytics.ItisthebodyoflessonsI’velearned
duringthesewilddataadventuresthatI’dliketoimparttoyouinthisandthesubsequent
chapter,dearreader.
It’scommonfordatasourcestobenotonlydifficulttoload,forvariousreasons,buttobe
difficulttoworkwithbecauseoferrors,junk,orjustgeneralidiosyncrasies.Becauseof
this,thischapterandthenextchapter,Dealingwithmessydatawillhavealotincommon.
Thischapterwillconcentratemoreongettingdatafromoutsidesourcesandgettingitinto
asomewhatusableforminR.Thenextchapterwilldiscussparticularlycommongotchas
whileworkingwithdatainanimperfectworld.
Iappreciatethatnoteveryonehastheinterestorthetime-availabilitytogoonwildgoose
huntsforpubliclyavailabledatatoanswerquestionsformedonawhim.Nevertheless,the
techniquesthatwe’llbediscussinginthischaptershouldbeveryhelpfulinhandlingthe
varietyofdataformatsthatyou’llhavetocontendwithinthecourseofyourworkor
research.Additionally,havingthewherewithaltoemployfreelyavailabledataontheweb
canbeindispensableforlearningnewanalyticsmethodsandtechnologies.
Thefirstsourceofdatawe’llbelookingatisthatofthevenerablerelationaldatabase.
RelationalDatabases
Perhapsthemostcommonexternalsourceofdataisfromrelationaldatabases.Sincethis
sectionisprobablyofinteresttoonlythosewhoworkwithdatabases—oratleastplanto
—someknowledgeofthebasicsofrelationaldatabasesisassumed.
OnewaytoconnecttodatabasesfromRistousetheRODBCpackage.Thisallowsoneto
accessanydatabasethatimplementstheODBCcommoninterface(forexample,
PostgreSQL,Access,Oracle,SQLite,DB2,andsoon).Amorecommonmethod—for
whateverreason—istousetheDBIpackageandDBI-compliantdrivers.
DBIisanRpackagethatdefinesageneralizedinterfaceforcommunicationbetween
differentdatabasesandR.LikewithODBC,itallowsthesamecompliantSQLtorunon
multipledatabases.TheDBIpackagealoneisnotsufficientforcommunicatingwithany
particulardatabasefromR;inordertouseDBI,youmustalsoinstallandloadaDBIcompliantdriverforyourparticulardatabase.Packagesexistprovidingdriversformany
RDBMSs.AmongthemareRPostgreSQL,RSQLite,RMySQL,andROracle.
InordertomosteasilydemonstrateR/DBcommunication,wewillbeusingaSQLite
database.Thiswillalsomosteasilyallowtheprudentreadertocreatetheexample
databaseandfollowalong.TheSQLwe’llbeusingisstandard,soyoucanreallyuseany
DByouwant,anyhow.
Ourexampledatabasehastwocolumns:artistsandpaintings.Theartiststable
containsauniqueintegerID,anartist’sname,andtheyeartheywereborn.The
paintingstablecontainsauniqueintegerID,anartistID,thenameofthepainting,and
itscompletiondate.TheartistIDinthepaintingstableisaforeignkeythatreferences
theartistIDintheartisttable;thisishowthisdatabaselinkspaintingstotheir
respectivepainters.
Ifyouwanttofollowalong,usethefollowingSQLstatementstocreateandpopulatethe
database.Ifyou’reusingSQLite,namethedatabaseart.db.
CREATETABLEartists(
artist_idINTEGERPRIMARYKEY,
nameTEXT,
born_onINTEGER
);
CREATETABLEpaintings(
painting_idINTEGERPRIMARYKEY,
painting_artistINTEGER,
painting_nameTEXT,
year_completedINTEGER,
FOREIGNKEY(painting_artist)REFERENCESartists(artist_id)
);
INSERTINTOartists(name,born_on)
VALUES("KaySage",1898),
("PietMondrian",1872),
("ReneMagritte",1898),
("ManRay",1890),
("Jean-MichelBasquiat",1960);
INSERTINTOpaintings(painting_artist,painting_name,year_completed)
VALUES(4,"OrquestaSinfonica",1916),
(4,"LaFortune",1938),
(1,"TommorowisNever",1955),
(1,"TheAnswerisNo",1958),
(1,"NoPassing",1954),
(5,"BirdonMoney",1981),
(2,"PlacedelaConcorde",1943),
(2,"CompositionNo.10",1942),
(3,"TheHumanCondition",1935),
(3,"TheTreacheryofImages",1948),
(3,"TheSonofMan",1964);
ConfirmforyourselfthatthefollowingSQLcommandsyieldtheappropriateresultsby
typingthemintothesqlite3commandlineinterface.
SELECT*FROMartists;
-------------------------------1|KaySage|1898
2|PietMondrian|1872
3|ReneMagritte|1898
4|ManRay|1890
5|Jean-MichelBasquiat|1960
SELECT*FROMpaintings;
-------------------------------------1|4|OrquestaSinfonica|1916
2|4|LaFortune|1938
3|1|TommorowisNever|1955
4|1|TheAnswerisNo|1958
5|1|NoPassing|1954
6|5|BirdonMoney|1981
7|2|PlacedelaConcorde|1943
8|2|CompositionNo.10|1942
9|3|TheHumanCondition|1935
10|3|TheTreacheryofImages|1948
11|3|TheSonofMan|1964
Forourfirstact,weloadthenecessarypackages,chooseourdatabasedriver,andconnect
tothedatabase:
library(DBI)
library(RSQLite)
sqlite<-dbDriver("SQLite")
#wereadtheartsqlitedbfromthecurrent
#workingdirectorywhichcanbegetandset
#withgetwd()andsetwd(),respectively
art_db<-dbConnect(sqlite,"./art.db")
Again,weareusingsqliteforthisexample,butthisprocedureisapplicabletoallDBIcompliantdatabasedrivers.
Let’snowrunaqueryagainstthisdatabase.Let’sgetalistofallthepaintingnamesand
theirrespectiveartist’sname.Thiswillrequireajoinoperationbetweenthetwotables:
result<-dbSendQuery(art_db,
"SELECTpaintings.painting_name,artists.name
FROMpaintingsINNERJOINartists
ONpaintings.painting_artist=artists.artist_id;")
response<-fetch(result)
head(response)
dbClearResult(result)
---------------------------------------------painting_namename
1OrquestaSinfonicaManRay
2LaFortuneManRay
3TommorowisNeverKaySage
4TheAnswerisNoKaySage
5NoPassingKaySage
HereweusedthedbSendQueryfunctiontosendaquerytothedatabase.Itsfirstand
secondargumentswerethedatabasehandlevariable(fromthedbConnectfunction)and
theSQLstatement,respectively.Westoreahandletotheresultinavariable.Next,the
fetchfunctionretrievestheresponsefromthehandle.Bydefault,itwillretrieveall
matchesfromthequery,thoughthiscanbelimitedbyspecifyingthenargument(see
help("fetch")).Theresultofthefetchisthenstoredinthevariableresponse.response
isanRdataframelikeanyother;wecandoanyoftheoperationswe’vealreadylearned
withit.Finally,wecleartheresult,whichisgoodpractice,becauseitfreesresources.
Foraslightlymoreinvolvedquery,let’strytofindtheaverage(mean)ageoftheartistsat
theagetheywerewheneachofthepaintingswerecompleted.Thisstillrequiresajoin,
butthistimeweareselectingpaintings.year_completedandartists.born_on.
result<-dbSendQuery(art_db,
"SELECTpaintings.year_completed,artists.born_on
FROMpaintingsINNERJOINartists
ONpaintings.painting_artist=artists.artist_id;")
response<-fetch(result)
head(response)
dbClearResult(result)
---------------------------year_completedborn_on
119161890
219381890
319551898
419581898
519541898
619811960
Atthistime,row-wisesubtractionandaveragingcanbeperformedsimply:
mean(response$year_completed-response$born_on)
----------[1]51.091
Finally,wecloseourconnectiontothedatabase:
dbDisconnect(art_db)
Whydidn’twejustdothatinSQL?
Why,indeed.Althoughthisverysimpleexamplecouldhaveeasilyjustbeenwritteninto
thelogicoftheSQLquery,formorecomplicateddataanalysisthissimplywon’tcutit.
Unlessyouareusingareallyspecializeddatabase,manydatabasesaren’tpreparedfor
certainmathematicalfunctionswithregardtonumericalaccuracy.Moreimportantly,most
databasesdon’timplementadvancedmathfunctionsatall.Eveniftheydid,theyalmost
certainlywouldn’tbeportablebetweendifferentRDBMSs.Thereisgreatmeritinhaving
analyticslogicresideinRsothatif—forwhateverreason—youhavetoswitchdatabases,
youranalysiscodewillremainunchanged.
Note
IfSQLisyourcupoftea,didyouknowyoucanusethesqldfpackagetoperform
arbitrarySQLqueriesondata.frames?
Thereisarisinginterestand(toalesserextent)needindatabasesthatdon’tadheretothe
relationalparadigm.Theseso-calledNoSQLdatabasesincludetheimmenselypopular
Hadoop/HDFS,MongoDB,CouchDB,Neo4j,andRedis,amongmanyothers.ThereareR
packagesforcommunicatingtomostofthese,too,includingoneforeveryoneofthe
databasesmentionedherebyname.Sincetheoperationofallofthesepackagesis
idiosyncraticandheavilydependentonwhichspeciesofNoSQLthedatabaseinquestion
belongsto,yourbestbetforlearninghowtousethisistoreadthehelppagesand/or
vignettesforeachpackage.
UsingJSON
JavaScriptObjectNotation(JSON)isastandardizedhuman-readabledataformatthat
playsanenormousroleincommunicationbetweenwebbrowserstowebservers.JSON
wasoriginallyborneoutofaneedtorepresentarbitrarilycomplexdatastructuresin
JavaScript—awebscriptinglanguage—butithassincegrownintoalanguageagnostic
dataserializationformat.
ItisacommonneedtoimportandparseJSONinR,particularlywhenworkingwithweb
data.Forexample,itisverycommonforwebsitestoofferwebservicesthattakean
arbitraryqueryfromawebbrowser,andreturntheresponseasJSON.Wewillseean
exampleofthisveryusecaselaterinthissection.
ForourfirstlookintoJSONparsingforR,we’llusethejsonlitepackagetoreadasmall
JSONstring,whichserializessomeinformationaboutthebestmusicalactinhistory,The
Beatles:
library(jsonlite)
example.json<-'
{
"thebeatles":{
"formed":1960,
"members":[
{
"firstname":"George",
"lastname":"Harrison"
},
{
"firstname":"Ringo",
"lastname":"Starr"
},
{
"firstname":"Paul",
"lastname":"McCartney"
},
{
"firstname":"John",
"lastname":"Lennon"
}
]
}
}'
the_beatles<-fromJSON(example.json)
print(the_beatles)
--------------------$thebeatles
$thebeatles$formed
[1]1960
$thebeatles$members
firstnamelastname
1GeorgeHarrison
2RingoStarr
3PaulMcCartney
4JohnLennon
WeusedthefromJSONfunctiontoreadinthestring.TheresultisanRlist,whose
elements/attributescanbeaccessedviathe$operator,orthe[[doublesquarebracket
function/operator.Forexample,wecanaccessthedatewhenTheBeatlesformed,inR,in
thefollowingtwoways:
the_beatles$thebeatles$formed
the_beatles[["thebeatles"]][["formed"]]
--------[1]1960
[1]1960
Note
InR,alistisadatastructurethatiskindoflikeavector,butallowselementsofdiffering
datatypes.Asinglelistmaycontainnumerics,strings,vectors,orevenotherlists!
NowthatwehavetheverybasicsofhandlingJSONdown,let’smoveontousingitina
non-trivialmanner!
There’samusic/social-media-platformcalledhttp://www.last.fmthat/thatkindlyprovides
awebserviceAPIthat’sfreeforpublicuse(aslongasyouabidebytheirreasonable
terms).ThisAPI(ApplicationProgrammingInterface)allowsustoqueryvarious
pointsofdataaboutmusicalartistsbycraftingspecialURLs.Theresultsoffollowing
theseURLsareeitheraJSONorXMLpayload,whicharedirectlyconsumablefromR.
Inthisnon-trivialexampleofusingwebdata,wewillbebuildingarudimentary
recommendationsystem.Oursystemwillallowustosuggestnewmusictoaparticular
personbasedonanartistthattheyalreadylike.Inordertodothis,wehavetoquerythe
Last.fmAPItogatherallthetagsassociatedwithparticularartists.Thesetagsfunctiona
lotlikegenreclassifications.Thesuccessofourrecommendationsystemwillbe
predicatedontheassumptionthatmusicalartistswithoverlappingtagsaremoresimilarto
eachotherthanartistswithdisparatetags,andthatsomeoneismorelikelytoenjoysimilar
artiststhananarbitrarydissimilarartist.
Here’sanexampleJSONexcerptoftheresultofqueryingtheAPIfortagsofaparticular
artist:
{
"toptags":{
"tag":[
{
"count":100,
"name":"femalevocalists",
"url":"http://www.last.fm/tag/female+vocalists"
},
{
"count":71,
"name":"singer-songwriter",
"url":"http://www.last.fm/tag/singer-songwriter"
},
{
"count":65,
"name":"pop",
"url":"http://www.last.fm/tag/pop"
}
]
}
}
Here,weonlycareaboutthenameofthetag—nottheURL,orthecountofoccasions
Last.fmusersappliedeachtagtotheartist.
Let’sfirstcreateafunctionthatwillconstructtheproperlyformattedqueryURLfora
particularartist.TheLast.fmdeveloperwebsiteindicatesthattheformatis:
http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist=
<THE_ARTIST>&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json
InordertocreatetheseURLsbaseduponarbitraryinput,wecanusethepaste0function
toconcatenatethecomponentstrings.However,URLscan’thandlecertaincharacterssuch
asspaces;inordertoconverttheartist’snametoaformatsuitableforaURL,we’lluse
theURLencodefunctionfromthe(preloaded)utilspackage.
URLencode("TheBeatles")
------[1]"The%20Beatles"
Nowwehaveallthepiecestoputthisfunctiontogether:
create_artist_query_url_lfm<-function(artist_name){
prefix<-"http://ws.audioscrobbler.com/2.0/?
method=artist.gettoptags&artist="
postfix<-"&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json"
encoded_artist<-URLencode(artist_name)
return(paste0(prefix,encoded_artist,postfix))
}
create_artist_query_url_lfm("DepecheMode")
-------------------[1]"http://ws.audioscrobbler.com/2.0/?
method=artist.gettoptags&artist=Depeche%20Mode&api_key=c2e57923a25c03f3d8b3
17b3c8622b43&format=json"
Fantastic!Nowwemakethewebrequest,andparsetheresultingJSON.Luckily,the
fromJSONfunctionthatwe’vebeenusingcantakeaURLandautomaticallymaketheweb
requestforus.Let’sseewhatitlookslike:
fromJSON(create_artist_query_url_lfm("DepecheMode"))
----------------------------------------$toptags
$toptags$tag
countnameurl
1100electronichttp://www.last.fm/tag/electronic
287newwavehttp://www.last.fm/tag/new+wave
35980shttp://www.last.fm/tag/80s
456synthpophttp://www.last.fm/tag/synth+pop
........
Neat-o!Ifyoutakeacloselookatthestructure,you’llseethatthetagnamesarestoredin
thenameattributeofthetagattributeofthetoptagsattribute(whew!).Thismeanswecan
extractjustthetagnameswith$toptags$tag$name.Let’swriteafunctionthatwilltakean
artist’sname,andreturnalistofthetagsinavector.
get_tag_vector_lfm<-function(an_artist){
artist_url<-create_artist_query_url_lfm(an_artist)
json<-fromJSON(artist_url)
return(json$toptags$tag$name)
}
get_tag_vector_lfm("DepecheMode")
-----------------------------------------[1]"electronic""newwave""80s"
[4]"synthpop""synthpop""seenlive"
[7]"alternative""rock""british"
........
Next,wehavetogoaheadandretrievethetagsforallartists.Insteadofdoingthis(and
probablyviolatingLast.fm’stermsofservice),we’lljustpretendthatthereareonlysix
musicalartistsintheworld.We’llstorealloftheseartistsinalist.Thiswillmakeiteasy
tousethelapplyfunctiontoapplytheget_tag_vector_lfmfunctiontoeachartistinthe
list.Finally,we’llnamealltheelementsinthelistappropriately:
our_artists<-list("KateBush","PeterTosh","Radiohead",
"TheSmiths","TheCure","BlackUhuru")
our_artists_tags<-lapply(our_artists,get_tag_vector_lfm)
names(our_artists_tags)<-our_artists
print(our_artists_tags)
-------------------------------------$`KateBush`
[1]"femalevocalists""singer-songwriter""pop"
[4]"alternative""80s""british"
........
$`PeterTosh`
[1]"reggae""rootsreggae""Rasta"
[4]"roots""ska""jamaican"
........
$Radiohead
[1]"alternative""alternativerock"
[3]"rock""indie"
........
$`TheSmiths`
[1]"indie""80s""post-punk"
[4]"newwave""alternative""rock"
........
$`TheCure`
[1]"post-punk""newwave""alternative"
[4]"80s""rock""seenlive"
........
$`BlackUhuru`
[1]"reggae""rootsreggae""dub"
[4]"jamaica""roots""jamaican"
........
Nowthatwehavealltheartists’tagsstoredasalistofvectors,weneedsomewayof
comparingthetaglists,andjudgethemforsimilarity.
Thefirstideathatmaycometomindistocountthenumberoftagseachpairofartists
haveincommon.Thoughthismayseemlikeagoodideaatfirstglance,considerthe
followingscenario:
ArtistAandartistBhavehundredsoftagseach,andtheysharethreetagsincommon;
artistCandDeachhavetwotags,bothofwhicharemutuallyshared.Ournaivemetricfor
similaritysuggeststhatartistsAandBaremoresimilarthanCandD(by50%).Ifyour
intuitiontellsyouthatCandDaremoresimilar,though,wearebothinagreement.
Tomakeoursimilaritymeasurecomportmorewithourintuition,wewillinsteadusethe
Jaccardindex.TheJaccardindex(alsoJaccardcoefficient)betweensetsAandB,
,isgivenby:
where isthesetintersection(thecommontags), isthesetunion(anunduplicatedlist
ofallthetagsinbothsets),and
thatset).
isthesetX’scardinality(thenumberofelementsin
Thismetrichastheattractivepropertythatitisnaturallyconstrained:
Let’swriteafunctionthattakestwosets,andreturnstheJaccardindex.We’llemploythe
built-infunctionsintersectandunion.
jaccard_index<-function(one,two){
length(intersect(one,two))/length(union(one,two))
}
Let’stryitonTheCureandRadiohead:
jaccard_index(our_artists_tags[["Radiohead"]],
our_artists_tags[["TheCure"]])
--------------[1]0.3333
Neat!Manualcheckingconfirmsthatthisistherightanswer.
Thenextstepistoconstructasimilaritymatrix.Thisisa
matrix(where isthe
numberofartists)thatdepictsallthepairwisesimilaritymeasurements.Ifthisexplanation
isconfusing,lookatthecodeoutputbeforereadingthefollowingcodesnippet:
similarity_matrix<-function(artist_list,similarity_fn){
num<-length(artist_list)
#initializeanumbynummatrixofzeroes
sim_matrix<-matrix(0,ncol=num,nrow=num)
#nametherowsandcolumnsforeasylookup
rownames(sim_matrix)<-names(artist_list)
colnames(sim_matrix)<-names(artist_list)
#foreachrowinthematrix
for(iin1:nrow(sim_matrix)){
#andeachcolumn
for(jin1:ncol(sim_matrix)){
#calculatethatpair'ssimilarity
the_index<-similarity_fn(artist_list[[i]],
artist_list[[j]])
#andstoreitintherightplaceinthematrix
sim_matrix[i,j]<-round(the_index,2)
}
}
return(sim_matrix)
}
sim_matrix<-similarity_matrix(our_artists_tags,jaccard_index)
print(sim_matrix)
--------------------------------------------------------------
KateBushPeterToshRadioheadTheSmithsTheCureBlackUhuru
KateBush1.000.050.310.250.210.04
PeterTosh0.051.000.020.030.030.33
Radiohead0.310.021.000.310.330.04
TheSmiths0.250.030.311.000.440.05
TheCure0.210.030.330.441.000.05
BlackUhuru0.040.330.040.050.051.00
Ifyou’refamiliarwithsomeofthesebands,you’llnodoubtseethatthesimilaritymatrix
intheprecedingoutputmakesalotofprimafaciesense—itlookslikeourtheoryissound!
Ifyounotice,thevaluesalongthediagonal(fromtheupper-leftpointtothelower-right)
areall1.ThisisbecausetheJaccardindexoftwoidenticalsetsisalways1—andartists’
similaritywiththemselvesisalways1.Additionally,allthevaluesaresymmetricwith
respecttothediagonal;whetheryoulookupPeterToshandRadioheadbycolumnand
thenrow,orviceversa,thevaluewillbethesame(.02).Thispropertymeansthatthe
matrixissymmetric.Thisisapropertyofallsimilaritymatricesusingsymmetric
(commutative)similarityfunctions.
Note
Asimilar(andperhapsmorecommon)conceptisthatofadistancematrix(or
dissimilaritymatrix).Theideaisthesame,butnowthevaluesthatarehigherwillreferto
moremusicallydistantpairsofartists.Also,thediagonalwillbezeroes,sinceanartistis
theleastmusicallydifferentfromthemselvesthananyotherartist.Ifallthevaluesofa
similaritymatrixarebetween0and1(asisoftenthecase),youcaneasilymakeitintoa
distancematrixbysubtracting1fromeveryelement.Subtractingfrom1againwillyield
theoriginalsimilaritymatrix.
Recommendationscannowbefurnished,forlistenersofoneofthebands,bysortingthat
artist’scolumninthematrixinadescendingorder;forexample,ifauserlikesThe
Smiths,butisunsurewhatotherbandssheshouldtrylisteningto:
#TheSmithsarethefourthcolumn
sim_matrix[order(sim_matrix[,4],decreasing=TRUE),4]
---------------------------------------------TheSmithsTheCureRadioheadKateBushBlackUhuru
1.000.440.310.250.05
PeterTosh
0.03
Ofcourse,arecommendationofTheSmithsforthisuserisnonsensical.Goingdownthe
list,itlookslikearecommendationofTheCureisthesafestbet,thoughRadioheadand
KateBushmayalsobefinerecommendations.BlackUhuruandPeterToshareunsafe
betsifallweknowabouttheuser’safondnessforTheSmiths.
XML
XML,likeJSON,isanabsolutelyubiquitousformatfordatatransferovertheInternet.In
additiontobeingusedontheweb,XMLisalsoapopulardataformatforapplication
configurationfilesandthelist.Infact,newerMicrosoftOfficedocuments(withthe
extension.docxor.xlsx)arestoredasXMLfiles.
Here’swhatoursimpleBeatlesdatasetmaylooklikeinXML:
example_xml1<-'
<the_beatles>
<formed>1960</formed>
<members>
<member>
<first_name>George</first_name>
<last_name>Harrison</last_name>
</member>
<member>
<first_name>Ringo</first_name>
<last_name>Starr</last_name>
</member>
<member>
<first_name>Paul</first_name>
<last_name>McCartney</last_name>
</member>
<member>
<first_name>John</first_name>
<last_name>Lennon</last_name>
</member>
</members>
</the_beatles>'
MuchlikeJSON,XMLisstoredinatreestructure—thisiscalledaDOM(Document
ObjectModel)treeinXMLparlance.EachpieceofinformationinanXMLdocument—
surroundedbynamesinanglebrackets—iscalledanelementornode.Inthehierarchical
structure,subnodesarecalledchildren.Intheprecedingcode,formedisachildof
the_beatles,andmemberisachildofmembers.Eachnodemayhavezeroormore
childrenwhomayhavechildrennodesoftheirown.Forexample,themembersnodehas
fourchildren,eachofwhomhavetwochildren,first_nameandlast_name.Thecommon
parentofalltheelements(whetherdirectparentorgreat-great-grandparent)istheroot
node,whichdoesn’thaveaparent.
Note
AswithJSON,XMLandXMLimportfunctionsisanenormoustopic.We’llonlybriefly
coversomeofthemorecommonandbasicknow-howinthischapter.Fortunately,Rhasa
built-inhelpanddocumentation.Forthispackage,help(package="XML")indicatesthat
moredocumentationisavailableatthepackage’sURL:http://www.omegahat.org/RSXML
WewillreadtheprecedingXMLwiththeXMLpackage.Ifyoudon’thaveitalready,make
sureyouinstallit.
library(XML)
the_beatles<-xmlTreeParse(example_xml1)
print(names(the_beatles))
------------------[1]"doc""dtd"
print(the_beatles$doc)
--------------------$file
[1]"<buffer>"
$version
[1]"1.0"
$children
$children$the_beatles
<the_beatles>
<formed>1960</formed>
<members>
<member>
<first_name>George</first_name>
<last_name>Harrison</last_name>
</member>
..........
</members>
</the_beatles>
attr(,"class")
[1]"XMLDocumentContent"
xmlTreeParsereadsandparsestheDOM,andstoresitasanRlist.Theactualcontentis
storedinthechildrenattributeofthedocattribute.WecanaccesstheyearTheBeatles
wereformedlikeso:
print(xmlValue(the_beatles$doc$children$the_beatles[["formed"]]))
---------------------[1]"1960"
Here,weusethexmlValuefunctiontoextractthevaluestoredintheformednode.
Ifwewantedtogettothefirstnamesofallthemembers,wehavetostoretherootnodeof
theDOM,anditerateoverthechildrenofthemembersnode.Inparticular,weusethe
sapplyfunction(whichappliesafunctiontoeachelementofavector)overthechildren
withafunctionthatreturnsthexmlvalueofthefirst_namenode.Concretely:
root<-xmlRoot(the_beatles)
sapply(xmlChildren(root[["members"]]),function(x){
xmlValue(x[["first_name"]])
})
------------------------------------------membermembermembermember
"George""Ringo""Paul""John"
Thoughit’spossibletoworkwiththeDOMinthismanner,itismuchmorecommonto
interrogateXMLusingXPath.
XPathiskindoflikeanXMLquerylanguage—likeSQL,butforXML.Itallowsusto
selectnodesthatmatchaparticularpatternorlocation.Formatching,itusespath
expressionsthatidentifynodesbasedontheirname,location,orrelationshipswithother
nodes.
Thispowerfultoolalsocomeswithaproportionallysteeplearningcurve.Luckily,itis
somewhateasytogetstarted.Inaddition,therearealotofgreattutorialsonline.The
excellenttutorialthattaughtmeXPathisavailableat
http://www.w3schools.com/xsl/xpath_intro.asp.
TouseXPath,wehavetore-importtheXMLusingthexmlParse(notXMLTreeParse)
function,whichusesadifferentoptimizedinternalrepresentation.Toreplicatetheresults
ofthepreviouscodesnippetusingXPath,wearegoingtousethefollowingXPath
statement:
all_first_names<-"//member/first_name"
Theprecedingstatementroughlytranslatesto“forallmembernodesanywhereoccurring
anywhereinthedocument,getthechildnodenamedfirst_name“.
the_beatles<-xmlParse(example_xml1)
getNodeSet(the_beatles,all_first_names)
-------[[1]]
<first_name>George</first_name>
[[2]]
<first_name>Ringo</first_name>
[[3]]
<first_name>Paul</first_name>
[[4]]
<first_name>John</first_name>
attr(,"class")
[1]"XMLNodeSet"
EquivalentXPathexpressionscouldalsobewrittenthus:
getNodeSet(the_beatles,"//first_name")
getNodeSet(the_beatles,"/the_beatles/members/member/first_name")
AndjusttheXMLvaluesforeachnodecanbeextractedthus:
sapply(getNodeSet(the_beatles,all_first_names),xmlValue)
------------------------------[1]"George""Ringo""Paul""John"
ThereismorethanonewaytorepresentthesameinformationinXML.Thefollowing
XMLisanotherwayofrepresentingthesamedataaboutTheBeatles.ThisusesXML
attributesinsteadofnodesforformed,first_name,andlast_name:
example_xml2<-'
<the_beatlesformed="1990">
<members>
<memberfirst_name="George"last_name="Harrison"/>
<memberfirst_name="Richard"last_name="Starkey"/>
<memberfirst_name="Paul"last_name="McCartney"/>
<memberfirst_name="John"last_name="Lennon"/>
</members>
</the_beatles>'
Inthiscase,retrievingavectorofallfirstnamescanbedoneusingthissnippet:
sapply(getNodeSet(the_beatles,"//member[@first_name]"),
function(x){xmlAttrs(x)[["first_name"]]})
----------[1]"George""Richard""Paul""John"
ItmayhelpunderstandingofXMLprocessinginRtouseitinareal-lifeexample.
ThereisarepositoryofmusicinformationcalledMusicBrainz(http://musicbrainz.org).
LikeLast.fm,thiswebsitekindlyallowscustomqueriesagainsttheirinfodatabase,and
returnstheresultsinXMLformat.
Wewillusethisservicetoextendtherecommendationsystemthatwecreatedjustusing
tagsfromLast.fmbycombiningthemwithtagsfromMusicBrainz.
Toquerythedatabaseforaparticularartist,theformatisasfollows:
http://musicbrainz.org/ws/2/artist/?query=artist:<THE_ARTIST>
Forexample,thequeryforKateBushis:http://musicbrainz.org/ws/2/artist/?
query=artist:Kate%20Bush
Ifyouvisitthatlink,you’llseethatitreturnsanXMLdocumentthatcontainsalistof
artiststhatmatchthesearchtovaryingdegrees.Thelistcontains,amongothers,John
Bush,ShellyBush,andBush.Luckily,thematchesareinorderofdescendingmatchiness
and,foralltheartiststhatwe’llbeworkingwith,thecorrectartististhefirstartistinthe
nodeartist-list.
Incaseyoucan’tviewthelinkyourself,thefollowingisessentiallythestructureofit:
<metadataxmlns="http://musicbrainz.org/ns/mmd-2.0#">
<artist-list>
<artist>
<name>KateBush</name>
<tag-list>
<tagcount="1">
<name>kent</name>
</tag>
<tagcount="1">
<name>english</name>
</tag>
<tagcount="3">
<name>british</name>
</tag>
</tag-list>
</artist>
<artist-list>
</metadata>
ThismeansthattheXPathexpressionsthatselectsallthetags(ofthefirstartist)isgiven
by://artist[1]/tag-list/tag/name
AswithJSON/Last.fm,let’swritethefunctionthat,foranygivenartist,returnsthe
appropriatequeryURL:
create_artist_query_url_mb<-function(artist){
encoded_artist<-URLencode(artist)
return(paste0("http://musicbrainz.org/ws/2/artist/?query=artist:",
encoded_artist))
}
create_artist_query_url_mb("DepecheMode")
------[1]"http://musicbrainz.org/ws/2/artist/?query=artist:Depeche%20Mode"
Now,let’swritethefunctionthatreturnsthelistoftagsforaparticularartist.
Becausenothingisevereasy,theXPathmentionedintheprecedingcodewillnotworkas
is.ThisisbecausetheMusicBrainzXMLusesanXMLnamespace.Thoughitmakesour
job(marginally)harder,anXMLnamespaceisgenerallyagoodthing,becauseit
eliminatesambiguitywhenreferringtoelementnamesbetweendifferentXMLdocuments
whoseelementnamesarearbitrarilydefinedbythedeveloper.
Astheresponsesuggests,thenamespaceisgivenbyhttp://musicbrainz.org/ns/mmd2.0#.InordertousethisinourtagextractionfunctionandXPathselecting,weneedto
storeandnamethisnamespacefirst:
ns<-"http://musicbrainz.org/ns/mmd-2.0#"
names(ns)[1]<-"ns"
NowwehaveallweneedtowritetheMusicBrainzcounterparttothe
get_tag_vector_lfmfunction.
get_tag_vector_mb<-function(an_artist,ns){
artist_url<-create_artist_query_url_mb(an_artist)
the_xml<-xmlParse(artist_url)
xpath<-"//ns:artist[1]/ns:tag-list/ns:tag/ns:name"
the_nodes<-getNodeSet(the_xml,xpath,ns)
return(unlist(lapply(the_nodes,xmlValue)))
}
get_tag_vector_mb("DepecheMode",ns)
------------------------------------[1]"electronica""postpunk""alternativedance"
[4]"electronic""darkwave""britannique"
............
LikefromJSON,xmlParsehandlesURLsnatively.
Let’sfinishthisup:
our_artists<-list("KateBush","PeterTosh","Radiohead",
"TheSmiths","TheCure","BlackUhuru")
our_artists_tags_mb<-lapply(our_artists,get_tag_vector_mb,ns)
names(our_artists_tags_mb)<-our_artists
sim_matrix<-similarity_matrix(our_artists_tags_mb,jaccard_index)
print(sim_matrix)
------KateBushPeterToshRadioheadTheSmithsTheCureBlackUhuru
KateBush1.000.000.240.270.240.00
PeterTosh0.001.000.000.000.000.17
Radiohead0.240.001.000.230.230.00
TheSmiths0.270.000.231.000.380.00
TheCure0.240.000.230.381.000.00
BlackUhuru0.000.170.000.000.001.00
>sim_matrix[order(sim_matrix[,4],decreasing=TRUE),4]
------------------------------TheSmithsTheCureKateBushRadioheadPeterToshBlackUhuru
1.000.380.270.230.000.00
Thisyieldsresultsthatarequitesimilartotherecommendationsystemthatusestagsfrom
onlyLast.fm.Personally,Iliketheformerbetter,buthowaboutwecombineboth?Wecan
dothiseasilybytakingthesetintersectionofartists’tagsbetweenthetwoservices.
for(iin1:length(our_artists_tags)){
the_artist<-names(our_artists_tags)[i]
#the_artistnowholdsthecurrentartist'sname
combined_tags<-union(our_artists_tags[[the_artist]],
our_artists_tags_mb[[the_artist]])
our_artists_tags[[the_artist]]<-combined_tags
}
sim_matrix<-similarity_matrix(our_artists_tags,jaccard_index)
print(sim_matrix)
-------KateBushPeterToshRadioheadTheSmithsTheCureBlackUhuru
KateBush1.000.040.290.240.190.03
PeterTosh0.041.000.010.030.030.29
Radiohead0.290.011.000.290.300.03
TheSmiths0.240.030.291.000.400.05
TheCure0.190.030.300.401.000.05
BlackUhuru0.030.290.030.050.051.00
Super!
Otherdataformats
OneofthingsthatmakeRgreatisthewealthofhigh-qualityadd-onpackages.Asyou
mightexpect,therearemanyoftheseadd-onpackageswiththeabilitytoimportdataina
multitudeofotherformats.Whetherit’sanarcanemarkup-language,aproprietarybinary
file,excelspreadsheet,andsoon,thereisalmostcertainlyanRpackageoutthereforyou
tohandleit.Buthowtofindthem?
OnewayistobrowsethecommunitymaintainedCRANTaskViews(https://cran.rproject.org/web/views/).Ataskviewisawaytobrowseforpackagesrelatedtoa
particulartopic,domain,orspecialinterest.ThegermaneTaskView,here,istheWeb
TechnologiesTaskView(https://cran.r-project.org/web/views/WebTechnologies.html).
You’llnoticethatjsonliteandtheXMLpackagearementionedonthefirstpage.
Theeasiestwaytodiscoverthesepackages,though,isthroughyourfavoritewebbrowser.
Forexample,ifyouarelookingforapackagetoimportYAMLdata(yetanotherdata
serializationformat),ImightsearchRCRANpackageyaml.Ifyouuseasearchengine
thattracksyou(don’tfightthesingularity),eventuallyasearchofonlyRyamlwillsuffice
togetyouwhereyouneedtogo.
Developingfastandreliableinformationretrievalskills(likesearch-engine-fu)isprobably
oneofthemostvaluableassetsofastatisticalprogrammer—oranyprogrammer,forthat
matter.Cultivatingtheseskillswillserveyouwell,dearreader.
Onlinerepositories
LookbacktotheWebTechnologiestaskviewwetalkedaboutintheprevioussection.
ThereareatremendousamountofRpackagesspecificallydesignedtoimportdata
directlyfromspecializedsourcesontheweb.Amongthesearepackagestosearchforand
retrievethefulltextofacademicarticlesinthePublicLibraryofSciencejournals(rplos),
searchforanddownloadthefulltextofWikipediaarticles(WikipediR),downloaddata
aboutBerlinfromtheGermangovernment(BerlinData),interfacewiththeChromosome
CountsDatabase(chromer),downloadhistoricalfinancialdata(quantmod),andaccessthe
informationinthePubChemchemistrydatabase(rpubchem).
Theseexamplesnotwithstanding,giventhattherearemanyhundredsofimmense
repositoriesofpublicdata,itisfartoomuchtoexpecttheRcommunitytohaveapackage
speciallybuiltforeverysingleone.Luckily,withtheabilitytohandlemanydifferentdata
formatsunderourbelt,wecanjustdownloadandimportthedatafromtheserepositories
ourselves.Thefollowingareafewofmyfavoriterepositories.Perhapssomeofthemwill
havededicatedRpackagesforhandlingthembythetimeyoureadthis.
data.gov:ahugerepositoryofdatafromtheUSgovernmentinavarietyofformats
includingCSV,XML,andJSON
data.gov.uk:theUK’sequivalentrepository
data.worldbank.org:aspotfordatamadeavailablebytheWorldBankincludingdata
onclimatechange,poverty,andaideffectiveness
archive.ics.uci.edu/ml/:333(attimeofwriting)datasetsofvariouslengthandwidths
fortestingstatisticallearningalgorithms
www.cdc.gov/nchs/data_access/ftp_data.htm:somehealth-relateddatasetsmade
availablebytheUSCenterofDiseaseControl
Exercises
Practicethefollowingexercisestorevisetheconceptslearnedinthischapter:
Howdidwewastecomputationinthesimilarity_matrixfunction?
BoththeLast.fmandtheMusicBrainzAPIhasacountvalueassociatedwitheach
tag,whichcanbetakentorepresenttheextenttowhichthetagappliedtotheartist.
Byignoringthisfield,inbothcases,weimplicitlyusedacountof1foreverytag—
makingwell-fittingtagsjustasimportantasrelativelylesswell-fittingones.Rewrite
thecodetotakecountintoaccount,andweigheachtagproportionallytoitscount
value.Thiswillbechallenging,butitwillbeinvaluableforunderstandingthe
material.ItwillalsoboostyourconfidenceasanRprogrammeronceyoufinish.Go
you!
Howelsemightyoubeabletoextendandimproveuponourragtagrecommender
system?
TheEfficientmarkethypothesispositsthatsincethepriceoffinancialinstruments
reflectsalltherelevantinformationaboutitsvalueatanygiventime,itisimpossible
toconsistentlybeatthemarket.Familiarizeyourselfwiththeweak,semi-strong,and
strongformulationsofthishypothesis.Which,ifany,ofthecampsdoyoualign
with?Why?Bespecific.
Summary
Thischapterbeganwithadiscussionofrelationdatabases.You’velearnedthattheDBI
packagedefinesastandardinterfaceonwhichvariousdatabasedriversbuildupon.You
thenlearnedhowtoquerythesetypesofdatabases,andloadtheresultsinR.
Next,yougainedanappreciationforJSONandXML(right?!),andhowtoapproachthe
importofdatafromtheseformats.Wethenputourchopstothetestbywieldingdata
providedtousbytwodifferentwebserviceAPIs.
IstealthilysnuckinsomefancynewRconstructsinthischapter.Forexample,priorto
thischapter,we’veneverexplicitlyworkedwithlistsbefore.
Finally,you’velearnedabouthowtolookforinformationbeyondwhichthischaptercan
provide,andsomeotherplacesthatwecangetdatatoplayaroundwith.
Inthenextchapter,wewon’tbetalkingabouthowtoloaddatafromdifferentsources—
we’llbetalkingabouthowtodealwithdisorderlydatathatisalreadyloaded.
Chapter11.DealingwithMessyData
Asmentionedinthelastchapter,analyzingdataintherealworldoftenrequiressome
know-howoutsideofthetypicalintroductorydataanalysiscurriculum.Forexample,
rarelydowegetaneatlyformatted,tidydatasetwithnoerrors,junk,ormissingvalues.
Rather,weoftengetmessy,unwieldydatasets.
Whatmakesadatasetmessy?Differentpeopleindifferentroleshavedifferentideasabout
whatconstitutesmessiness.Someregardanydatathatinvalidatestheassumptionsofthe
parametricmodelasmessy.Othersseemessinessindatasetswithagrievouslyimbalanced
numberofobservationsineachcategoryforacategoricalvariable.Someexamplesof
thingsthatIwouldconsidermessyare:
Manymissingvalues(NAs)
Misspellednamesincategoricalvariables
Inconsistentdatacoding
Numbersinthesamecolumnbeingindifferentunits
Mis-recordeddataanddataentrymistakes
Extremeoutliers
Sincethereareaninfinitenumberofwaysthatdatacanbemessy,there’ssimplyno
chanceofenumeratingeveryexampleandtheirrespectivesolutions.Instead,wearegoing
totalkabouttwotoolsthathelpcombatthebulkofthemessinessissuesthatIcitedjust
now.
Analysiswithmissingdata
Missingdataisanotheroneofthosetopicsthatarelargelyignoredinmostintroductory
texts.Probably,partofthereasonwhythisisthecaseisthatmanymythsaboutanalysis
withmissingdatastillabound.Additionally,someoftheresearchintocutting-edge
techniquesisstillrelativelynew.Amorelegitimatereasonforitsabsenceinintroductory
textsisthatmostofthemoreprincipledmethodologiesarefairlycomplicated—
mathematicallyspeaking.Nevertheless,theincredibleubiquityofproblemsrelatedto
missingdatainreallifedataanalysisnecessitatessomebroachingofthesubject.This
sectionservesasagentleintroductionintothesubjectandoneofthemoreeffective
techniquesfordealingwithit.
Acommonrefrainonthesubjectissomethingalongthelinesofthebestwaytodealwith
missingdataisnottohaveany.It’struethatmissingdataisamessysubject,andthereare
alotofwaystodoitwrong.It’simportantnottotakethisadvicetotheextreme,though.
Inordertobypassmissingdataproblems,somehavedisallowedsurveyparticipants,for
example,togoonwithoutansweringallthequestionsonaform.Youcancoercethe
participantsinalongitudinalstudytonotdropout,too.Don’tdothis.Notonlyisit
unethical,itisalsoprodigiouslycounter-productive;therearetreatmentsformissingdata,
buttherearenotreatmentsforbaddata.
Thestandardtreatmenttotheproblemofmissingdataistoreplacethemissingdatawith
non-missingvalues.Thisprocessiscalledimputation.Inmostcases,thegoalof
imputationisnottorecreatethelostcompleteddatasetbuttoallowvalidstatistical
estimatesorinferencestobedrawnfromincompletedata.Becauseofthis,the
effectivenessofdifferentimputationtechniquescan’tbeevaluatedbytheirabilitytomost
accuratelyrecreatethedatafromasimulatedmissingdataset;theymust,instead,be
judgedbytheirabilitytosupportthesamestatisticalinferencesaswouldbedrawnfrom
theanalysisonthecompletedata.Inthisway,fillinginthemissingdataisonlyastep
towardstherealgoal—theanalysis.Theimputeddatasetisrarelyconsideredthefinalgoal
ofimputation.
Therearemanydifferentwaysthatmissingdataisdealtwithinpractice—somearegood,
somearenotsogood.Someareokayundercertaincircumstances,butnotokayinothers.
Someinvolvemissingdatadeletion,whilesomeinvolveimputation.Wewillbrieflytouch
onsomeofthemorecommonmethods.Theultimategoalofthischapter,though,istoget
youstartedonwhatisoftendescribedasthegold-standardofimputationtechniques:
multipleimputation.
Visualizingmissingdata
Inordertodemonstratethevisualizingpatternsofmissingdata,wefirsthavetocreate
somemissingdata.Thiswillalsobethesamedatasetthatweperformanalysisonlaterin
thechapter.Toshowcasehowtousemultipleimputationforasemi-realisticscenario,we
aregoingtocreateaversionofthemtcarsdatasetwithafewmissingvalues:
Okay,let’ssettheseed(fordeterministicrandomness),andcreateavariabletoholdour
newmarreddataset.
set.seed(2)
miss_mtcars<-mtcars
First,wearegoingtocreatesevenmissingvaluesindrat(about20percent),fivemissing
valuesinthempgcolumn(about15percent),fivemissingvaluesinthecylcolumn,three
missingvaluesinwt(about10percent),andthreemissingvaluesinvs:
some_rows<-sample(1:nrow(miss_mtcars),7)
miss_mtcars$drat[some_rows]<-NA
some_rows<-sample(1:nrow(miss_mtcars),5)
miss_mtcars$mpg[some_rows]<-NA
some_rows<-sample(1:nrow(miss_mtcars),5)
miss_mtcars$cyl[some_rows]<-NA
some_rows<-sample(1:nrow(miss_mtcars),3)
miss_mtcars$wt[some_rows]<-NA
some_rows<-sample(1:nrow(miss_mtcars),3)
miss_mtcars$vs[some_rows]<-NA
Now,wearegoingtocreatefourmissingvaluesinqsec,butonlyforautomaticcars:
only_automatic<-which(miss_mtcars$am==0)
some_rows<-sample(only_automatic,4)
miss_mtcars$qsec[some_rows]<-NA
Now,let’stakealookatthedataset:
>miss_mtcars
mpgcyldisphpdratwtqsecvsamgearcarb
MazdaRX421.06160.01103.902.62016.460144
MazdaRX4Wag21.06160.01103.902.87517.020144
Datsun71022.84108.0933.85NA18.611141
Hornet4Drive21.46258.0110NA3.21519.441031
HornetSportabout18.78360.0175NA3.44017.020032
Valiant18.1NA225.0105NA3.460NA1031
Great,nowlet’svisualizethemissingness.
Thefirstwaywearegoingtovisualizethepatternofmissingdataisbyusingthe
md.patternfunctionfromthemicepackage(whichisalsothepackagethatweare
ultimatelygoingtouseforimputingourmissingdata).Ifyoudon’thavethepackage
already,installit.
>library(mice)
>md.pattern(miss_mtcars)
disphpamgearcarbwtvsqsecmpgcyldrat
12111111111110
4111111110111
2111111111011
3111111111101
3111110111111
2111111101111
1111111110102
1111111101012
1111111011012
2111111011102
1111111101003
0000033455727
Arow-wisemissingdatapatternreferstothecolumnsthataremissingforeachrow.This
functionaggregatesandcountsthenumberofrowswiththesamemissingdatapattern.
Thisfunctionoutputsabinary(0and1)matrix.Cellswitha1representnon-missingdata;
0srepresentmissingdata.Sincetherowsaresortedinanincreasing-amount-ofmissingnessorder,thefirstrowalwaysreferstothemissingdatapatterncontainingthe
leastamountofmissingdata.
Inthiscase,themissingdatapatternwiththeleastamountofmissingdataisthepattern
containingnomissingdataatall.Becauseofthis,thefirstrowhasall1sinthecolumns
thatarenamedafterthecolumnsinthemiss_mtcarsdataset.Theleft-mostcolumnisa
countofthenumberofrowsthatdisplaythemissingdatapattern,andtheright-most
columnisacountofthenumberofmissingdatapointsinthatpattern.Thelastrow
containsacountofthenumberofmissingdatapointsineachcolumn.
Asyoucansee,12oftherowscontainnomissingdata.Thenextmostcommonmissing
datapatternistheonewithmissingjustmpg;fourrowsfitthispattern.Thereareonlysix
rowsthatcontainmorethanonemissingvalue.Onlyoneoftheserowscontainsmorethan
twomissingvalues(asshowninthesecond-to-lastrow).
Asfarasdatasetswithmissingdatago,thisparticularonedoesn’tcontainmuch.Itisnot
uncommonforsomedatasetstohavemorethan30percentofitsdatamissing.Thisdata
setdoesn’tevenhit3percent.
Nowlet’svisualizethemissingdatapatterngraphicallyusingtheVIMpackage.Youwill
probablyhavetoinstallthis,too.
library(VIM)
aggr(miss_mtcars,numbers=TRUE)
Figure11.1:TheoutputofVIM’svisualaggregationofmissingdata.Theleftplotshows
theproportiononmissingvaluesforeachcolumn.Therightplotdepictstheprevalenceof
row-wisemissingdatapatterns,likemd.pattern
Ataglance,thisrepresentationshowsus,effortlessly,thatthedratcolumnaccountsfor
thehighestproportionofmissingness,column-wise,followedbympg,cyl,qsec,vs,and
wt.Thegraphicontherightshowsusinformationsimilartothatoftheoutputof
md.pattern.Thisrepresentation,though,makesiteasiertotellifthereissomesystematic
patternofmissingness.Thebluecellsrepresentnon-missingdata,andtheredcells
representmissingdata.Thenumbersontherightofthegraphicrepresenttheproportionof
rowsdisplayingthatmissingdatapattern.37.5percentoftherowscontainnomissingdata
whatsoever.
Typesofmissingdata
TheVIMpackageallowedustovisualizethemissingdatapatterns.Arelatedterm,the
missingdatamechanism,describestheprocessthatdetermineseachdatapoint’s
likelihoodofbeingmissing.Therearethreemaincategoriesofmissingdatamechanisms:
MissingCompletelyAtRandom(MCAR),MissingAtRandom(MAR),andMissing
NotAtRandom(MNAR).Discriminationbasedonmissingdatamechanismiscrucial,
sinceitinformsusabouttheoptionsforhandlingthemissingness.
Thefirstmechanism,MCAR,occurswhendata’smissingnessisunrelatedtothedata.
Thiswouldoccur,forexample,ifrowsweredeletedfromadatabaseatrandom,orifa
gustofwindtookarandomsampleofasurveyor’ssurveyformsoffintothehorizon.The
mechanismthatgovernsthemissingnessofdrat,mpg,cyl,wt,andvs‘isMCAR,because
werandomlyselectedelementstogomissing.Thismechanism,whilebeingtheeasiestto
workwith,isseldomtenableinpractice.
MNAR,ontheotherhand,occurswhenavariable’smissingnessisrelatedtothevariable
itself.Forexample,supposethescalethatweighedeachcarhadacapacityofonly3,700
pounds,andbecauseofthis,theeightcarsthatweighedmorethanthatwererecordedas
NA.ThisisaclassicexampleoftheMNARmechanism—itistheweightofthe
observationitselfthatisthecauseforitsbeingmissing.Anotherexamplewouldbeif
duringthecourseoftrialofananti-depressantdrug,participantswhowerenotbeing
helpedbythedrugbecametoodepressedtocontinuewiththetrial.Attheendofthetrial,
whenalltheparticipants’levelofdepressionisaccessedandrecorded,therewouldbe
missingvaluesforparticipantswhosereasonforabsenceisrelatedtotheirlevelof
depression.
Thelastmechanism,missingatrandom,issomewhatunfortunatelynamed.Contraryto
whatitmaysoundlike,itmeansthereisasystematicrelationshipbetweenthe
missingnessofanoutcomevariable’andotherobservedvariables,butnottheoutcome
variableitself.Thisisprobablybestexplainedbythefollowingexample.
Supposethatinasurvey,thereisaquestionaboutincomelevelthat,initswording,usesa
particularcolloquialism.Duetothis,alargenumberoftheparticipantsinthesurvey
whosenativelanguageisnotEnglishcouldn’tinterpretthequestion,andleftitblank.If
thesurveycollectedjustthename,gender,andincome,themissingdatamechanismofthe
questiononincomewouldbeMNAR.If,however,thequestionnaireincludedaquestion
thataskediftheparticipantspokeEnglishasafirstlanguage,thenthemechanismwould
beMAR.TheinclusionoftheIsEnglishyourfirstlanguage?variablemeansthatthe
missingnessoftheincomequestioncanbecompletelyaccountedfor.Thereasonforthe
monikermissingatrandomisthatwhenyoucontroltherelationshipbetweenthemissing
variableandtheobservedvariable(s)itisrelatedto(forexample,Whatisyourincome?
andIsEnglishyourfirstlanguage?respectively),thedataaremissingatrandom.
Asanotherexample,thereisasystematicrelationshipbetweentheamandqsecvariables
inoursimulatedmissingdataset:qsecsweremissingonlyforautomaticcars.Butwithin
thegroupofautomaticcars,theqsecvariableismissingatrandom.Therefore,qsec‘s
mechanismisMAR;controllingfortransmissiontype,qsecismissingatrandom.Bearin
mind,though,ifweremovedamfromoursimulateddataset,qsecwouldbecomeMNAR.
Asmentionedearlier,MCARistheeasiesttypetoworkwithbecauseofthecomplete
absenceofasystematicrelationshipinthedata’smissingness.Manyunsophisticated
techniquesforhandlingmissingdatarestontheassumptionthatthedataareMCAR.On
theotherhand,MNARdataisthehardesttoworkwithsincethepropertiesofthemissing
datathatcauseditsmissingnesshastobeunderstoodquantifiably,andincludedinthe
imputationmodel.ThoughmultipleimputationscanhandletheMNARmechanisms,the
proceduresinvolvedbecomemorecomplicatedandfarbeyondthescopeofthistext.The
MCARandMARmechanismsallowusnottoworryaboutthepropertiesandparameters
ofthemissingdata.Forthisreason,maysometimesfindMCARorMARmissingness
beingreferredtoasignorablemissingness.
MARdataisnotashardtoworkwithasMNARdata,butitisnotasforgivingasMCAR.
Forthisreason,thoughoursimulateddatasetcontainsMCARandMARcomponents,the
mechanismthatdescribesthewholedataisMAR—justoneMARmechanismmakesthe
wholedatasetMAR.
Sowhichoneisit?
Youmayhavenoticedthattheplaceofaparticulardatasetinthemissingdatamechanism
taxonomyisdependentonthevariablesthatitincludes.Forexample,weknowthatthe
mechanismbehindqsecisMAR,butifthedatasetdidnotincludeam,itwouldbe
MNAR.Sincewearetheonesthatcreatedthedata,weknowtheprocedurethatresulted
inqsec‘smissingvalues.Ifweweren’ttheonescreatingthedata—ashappensinthereal
world—andthedatasetdidnotcontaintheamcolumn,wewouldjustseeabunchof
arbitrarilymissingqsecvalues.ThismightleadustobelievethatthedataisMCAR.It
isn’t,though;justbecausethevariabletowhichanothervariable’smissingnessis
systematicallyrelatedisnon-observed,doesn’tmeanthatitdoesn’texist.
Thisraisesacriticalquestion:canweeverbesurethatourdataisnotMNAR?The
unfortunateanswerisno.SincethedatathatweneedtoproveordisproveMNARisipso
factomissing,theMNARassumptioncanneverbeconclusivelydisconfirmed.It’sour
job,ascriticallythinkingdataanalysts,toaskwhetherthereislikelyanMNAR
mechanismornot.
Unsophisticatedmethodsfordealingwithmissing
data
Herewearegoingtolookatvarioustypesofmethodsfordealingwithmissingdata:
Completecaseanalysis
Thismethod,alsocalledlist-wisedeletion,isastraightforwardprocedurethatsimply
removesallrowsorelementscontainingmissingvaluespriortotheanalysis.Inthe
univariatecase—takingthemeanofthedratcolumn,forexample—allelementsofdrat
thataremissingwouldsimplyberemoved:
>mean(miss_mtcars$drat)
[1]NA
>mean(miss_mtcars$drat,na.rm=TRUE)
[1]3.63
Inamultivariateprocedure—forexample,linearregressionpredictingmpgfromam,wt,
andqsec—allrowsthathaveamissingvalueinanyofthecolumnsincludedinthe
regressionareremoved:
listwise_model<-lm(mpg~am+wt+qsec,
data=miss_mtcars,
na.action=na.omit)
##OR
#complete.casesreturnsabooleanvector
comp<-complete.cases(cbind(miss_mtcars$mpg,
miss_mtcars$am,
miss_mtcars$wt,
miss_mtcars$qsec))
comp_mtcars<-mtcars[comp,]
listwise_model<-lm(mpg~am+wt+qsec,
data=comp_mtcars)
UnderanMCARmechanism,acompletecaseanalysisproducesunbiasedestimatesofthe
mean,variance/standarddeviation,andregressioncoefficients,whichmeansthatthe
estimatesdon’tsystematicallydifferfromthetruevaluesonaverage,sincetheincluded
dataelementsarejustarandomsamplingoftherecordeddataelements.However,
inference-wise,sincewelostanumberofoursamples,wearegoingtolosestatistical
powerandgeneratestandarderrorsandconfidenceintervalsthatarebiggerthantheyneed
tobe.Additionally,inthemultivariateregressioncase,notethatoursamplesizedepends
onthevariablesthatweincludeintheregression;morethevariables,moreisthemissing
datathatweopenourselvesupto,andmoretherowsthatweareliabletolose.Thismakes
comparingresultsacrossdifferentmodelsslightlyhairy.
UnderanMARorMNARmechanism,list-wisedeletionwillproducebiasedestimatesof
themeanandvariance.Forexample,ifamwerehighlycorrelatedwithqsec,thefactthat
wearemissingqseconlyforautomaticcarswouldsignificantlyshiftourestimatesofthe
meanofqsec.Surprisingly,list-wisedeletionproducesunbiasedestimatesofthe
regressioncoefficients,evenifthedataisMNARorMAR,aslongastherelevant
variablesareincludedintheregressionequations.Forthisreason,iftherearerelatively
fewmissingvaluesinadatasetthatistobeusedinregressionanalysis,list-wisedeletion
couldbeanacceptablealternativetomoreprincipledapproaches.
Pairwisedeletion
Alsocalledavailable-caseanalysis,thistechniqueis(somewhatunfortunately)common
whenestimatingcovarianceorcorrelationmatrices.Foreachpairofvariables,itonlyuses
thecasesthatarenon-missingforboth.Thisoftenmeansthatthenumberofelementsused
willvaryfromcelltocellofthecovariance/correlationmatrices.Thiscanresultinabsurd
correlationcoefficientsthatareabove1,makingtheresultingmatriceslargelyuselessto
methodologiesthatdependonthem.
Meansubstitution
Meansubstitution,asthenamesuggests,replacesallthemissingvalueswiththemeanof
theavailablecases.Forexample:
mean_sub<-miss_mtcars
mean_sub$qsec[is.na(mean_sub$qsec)]<-mean(mean_sub$qsec,
na.rm=TRUE)
#etc…
Althoughthisseeminglysolvestheproblemofthelossofsamplesizeinthelist-wise
deletionprocedure,meansubstitutionhassomeveryunsavorypropertiesofit’sown.
Whilstmeansubstitutionproducesunbiasedestimatesofthemeanofacolumn,it
producesbiasedestimatesofthevariance,sinceitremovesthenaturalvariabilitythat
wouldhaveoccurredinthemissingvalueshadtheynotbeenmissing.Thevariance
estimatesfrommeansubstitutionwillthereforebe,systematically,toosmall.Additionally,
it’snothardtoseethatmeansubstitutionwillresultinbiasedestimatesifthedataare
MARorMNAR.Forthesereasons,meansubstitutionisnotrecommendedundervirtually
anycircumstance.
Hotdeckimputation
Hotdeckimputationisanintuitivelyelegantapproachthatfillsinthemissingdatawith
donorvaluesfromanotherrowinthedataset.Intheleastsophisticatedformulation,a
randomnon-missingelementfromthesamedatasetissharedwithamissingvalue.In
moresophisticatedhotdeckapproaches,thedonorvaluecomesfromarowthatissimilar
totherowwiththemissingdata.Themultipleimputationtechniquesthatwewillbeusing
inalatersectionofthischapterborrowsthisideaforoneofitsimputationmethods.
Note
Thetermhotdeckreferstotheoldpracticeofstoringdataindecksofpunchcards.The
deckthatholdsthedonorvaluewouldbehotbecauseitistheonethatiscurrentlybeing
processed.
Regressionimputation
Thisapproachattemptstofillinthemissingdatainacolumnusingregressiontopredict
likelyvaluesofthemissingelementsusingothercolumnsaspredictors.Forexample,
usingregressionimputationonthedratcolumnwouldemployalinearregression
predictingdratfromalltheothercolumnsinmiss_mtcars.Theprocesswouldbe
repeatedforallcolumnscontainingmissingdata,untilthedatasetiscomplete.
Thisprocedureisintuitivelyappealing,becauseitintegratesknowledgeoftheother
variablesandpatternsofthedataset.Thiscreatesasetofmoreinformedimputations.Asa
result,thisproducesunbiasedestimatesofthemeanandregressioncoefficientsunder
MCARandMAR(solongastherelevantvariablesareincludedintheregressionmodel.
However,thisapproachisnotwithoutitsproblems.Thepredictedvaluesofthemissing
datalierightontheregressionlinebut,asweknow,veryfewdatapointslierightonthe
regressionline—thereisusuallyanormallydistributedresidual(error)term.Duetothis,
regressionimputationunderestimatesthevariabilityofthemissingvalues.Asaresult,it
willresultinbiasedestimatesofthevarianceandcovariancebetweendifferentcolumns.
However,we’reontherighttrack.
Stochasticregressionimputation
Asfarasunsophisticatedapproachesgo,stochasticregressionisfairlyevolved.This
approachsolvessomeoftheissuesofregressionimputation,andproducesunbiased
estimatesofthemean,variance,covariance,andregressioncoefficientsunderMCARand
MAR.Itdoesthisbyaddingarandom(stochastic)valuetothepredictionsofregression
imputation.Thisrandomaddedvalueissampledfromtheresidual(error)distributionof
thelinearregression—which,ifyouremember,isassumedtobeanormaldistribution.
Thisrestoresthevariabilityinthemissingvalues(thatwelostinregressionimputation)
thatthosevalueswouldhavehadiftheyweren’tmissing.
However,asfarassubsequentanalysisandinferenceontheimputeddatasetgoes,
stochasticregressionresultsinstandarderrorsandconfidenceintervalsthataresmaller
thantheyshouldbe.Sinceitproducesonlyoneimputeddataset,itdoesnotcapturethe
extenttowhichweareuncertainabouttheresidualsandourcoefficientestimates.
Nevertheless,stochasticregressionformsthebasisofstillmoresophisticatedimputation
methods.
Therearetwosophisticated,well-founded,andrecommendedmethodsofdealingwith
missingdata.OneiscalledtheExpectationMaximization(EM)method,whichwedo
notcoverhere.ThesecondiscalledMultipleImputation,andbecauseitiswidely
consideredthemosteffectivemethod,itistheoneweexploreinthischapter.
Multipleimputation
Thebigideabehindmultipleimputationisthatinsteadofgeneratingonesetofimputed
datawithourbestestimationofthemissingdata,wegeneratemultipleversionsofthe
imputeddatawheretheimputedvaluesaredrawnfromadistribution.Theuncertainty
aboutwhattheimputedvaluesshouldbeisreflectedinthevariationbetweenthemultiply
imputeddatasets.
Weperformourintendedanalysisseparatelywitheachofthesemamountofcompleted
datasets.Theseanalyseswillthenyieldmdifferentparameterestimates(likeregression
coefficients,andsoon).Thecriticalpointisthattheseparameterestimatesaredifferent
solelyduetothevariabilityintheimputedmissingvalues,andhence,ouruncertainty
aboutwhattheimputedvaluesshouldbe.Thisishowmultipleimputationintegrates
uncertainty,andoutperformsmorelimitedimputationmethodsthatproduceoneimputed
dataset,conferringanunwarrantedsenseofconfidenceinthefilled-indataofouranalysis.
Thefollowingdiagramillustratesthisidea:
Figure11.2:Multipleimputationinanutshell
Sohowdoesmicecomeupwiththeimputedvalues?
Let’sfocusontheunivariatecase—whereonlyonecolumncontainsmissingdataandwe
usealltheother(completed)columnstoimputethemissingvalues—beforegeneralizing
toamultivariatecase.
miceactuallyhasafewdifferentimputationmethodsupitssleeve,eachbestsuitedfora
particularusecase.micewilloftenchoosesensibledefaultsbasedonthedatatype
(continuous,binary,non-binarycategorical,andsoon).
Themostimportantmethodiswhatthepackagecallsthenormmethod.Thismethodis
verymuchlikestochasticregression.Eachofthemimputationsiscreatedbyaddinga
normal“noise”termtotheoutputofalinearregressionpredictingthemissingvariable.
Whatmakesthisslightlydifferentthanjuststochasticregressionrepeatedmtimesisthat
thenormmethodalsointegratesuncertaintyabouttheregressioncoefficientsusedinthe
predictivelinearmodel.
Recallthattheregressioncoefficientsinalinearregressionarejustestimatesofthe
populationcoefficientsfromarandomsample(that’swhyeachregressioncoefficienthas
astandarderrorandconfidenceinterval).Anothersamplefromthepopulationwouldhave
yieldedslightlydifferentcoefficientestimates.Ifthroughallourimputations,wejust
addedanormalresidualtermfromalinearregressionequationwiththesamecoefficients,
wewouldbesystematicallyunderstatingouruncertaintyregardingwhattheimputed
valuesshouldbe.
Tocombatthis,inmultipleimputation,eachimputationofthedatacontainstwosteps.
Thefirststepperformsstochasticlinearregressionimputationusingcoefficientsforeach
predictorestimatedfromthedata.Thesecondstepchoosesslightlydifferentestimatesof
theseregressioncoefficients,andproceedsintothenextimputation.Thefirststepofthe
nextimputationusestheslightlydifferentcoefficientestimatestoperformstochastic
linearregressionimputationagain.Afterthat,inthesecondstepoftheseconditeration,
stillothercoefficientestimatesaregeneratedtobeusedinthethirdimputation.Thiscycle
goesonuntilwehavemmultiplyimputeddatasets.
Howdowechoosethesedifferentcoefficientestimatesatthesecondstepofeach
imputation?Traditionally,theapproachisBayesianinnature;thesenewcoefficientsare
drawnfromeachofthecoefficients’posteriordistribution,whichdescribescrediblevalues
oftheestimateusingtheobserveddataanduninformativepriors.Thisistheapproachthat
normuses.Thereisanalternatemethodthatchoosesthesenewcoefficientestimatesfrom
asamplingdistributionthatiscreatedbytakingrepeatedsamplesofthedata(with
replacement)andestimatingtheregressioncoefficientsofeachofthesesamples.mice
callsthismethodnorm.boot.
Themultivariatecaseisalittlemorehairy,sincetheimputationforonecolumndepends
ontheothercolumns,whichmaycontainmissingdataoftheirown.
Forthisreason,wemakeseveralpassesoverallthecolumnsthatneedimputing,untilthe
imputationofallmissingdatainaparticularcolumnisinformedbyinformedestimatesof
themissingdatainthepredictorcolumns.Thesepassesoverallthecolumnsarecalled
iterations.
Sothatyoureallyunderstandhowthisiterationworks,let’ssayweareperforming
multipleimputationonasubsetofmiss_mtcarscontainingonlympg,wtanddrat.First,
allthemissingdatainallthecolumnsaresettoaplaceholdervaluelikethemeanora
randomlysamplednon-missingvaluefromitscolumn.Then,wevisitmpgwherethe
placeholdervaluesareturnedbackintomissingvalues.Thesemissingvaluesarepredicted
usingthetwo-partproceduredescribedintheunivariatecase.Thenwemoveontowt;the
placeholdervaluesareturnedbackintomissingvalues,whosenewvaluesareimputed
withthetwo-stepunivariateprocedureusingmpganddrataspredictors.Thenthisis
repeatedwithdrat.Thisisoneiteration.Onthenextiteration,itisnottheplaceholder
valuesthatgetturnedbackintorandomvaluesandimputedbuttheimputedvaluesfrom
thepreviousiteration.Asthisrepeats,weshiftawayfromthestartingvaluesandthe
imputedvaluesbegintostabilize.Thisusuallyhappenswithinjustafewiterations.The
datasetatthecompletionofthelastiterationisthefirstmultiplyimputeddataset.Eachm
startstheiterationprocessalloveragain.
Thedefaultinmiceisfiveiterations.Ofcourse,youcanincreasethisnumberifyouhave
reasontobelievethatyouneedto.We’lldiscusshowtotellifthisisnecessarylaterinthe
section.
Methodsofimputation
Themethodofimputationthatwedescribedfortheunivariatecase,norm,worksbestfor
imputedvaluesthatfollowanunconstrainednormaldistribution—butitcouldleadto
somenonsensicalimputationsotherwise.Forexample,sincetheweightsinwtaresoclose
to0(becauseit’sinunitsofathousandpounds)itispossibleforthenormmethodto
imputeanegativeweight.Thoughthiswillnodoubtbalanceoutovertheotherm-1
multiplyimputeddatasets,wecancombatthissituationbyusinganothermethodof
imputationcalledpredictivemeanmatching.
Predictivemeanmatching(micecallsthispmm)worksalotlikenorm.Thedifferenceisthat
thenormimputationsarethenusedtofindthedclosestvaluestotheimputedvalueamong
thenon-missingdatainthecolumn.Then,oneofthesedvaluesischosenasthefinal
imputedvalue—d=3isthedefaultinmice.
Thismethodhasafewgreatproperties.Forone,thepossibilityofimputinganegative
valueforwtiscategoricallyoffthetable;theimputedvaluewouldhavetobechosenfrom
theset{1.513,1.615,1.835},sincethesearethethreelowestweights.Moregenerally,any
naturalconstraintinthedata(lowerorupperbounds,integercountdata,numbersrounded
tothenearestone-half,andsoon)isrespectedwithpredictivemeanmatching,becausethe
imputedvaluesappearintheactualnon-missingobservedvalues.Inthisway,predictive
meanmatchingislikehot-deckimputation.Predictivemeanmatchingisthedefault
imputationmethodinmicefornumericaldata,thoughitmaybeinferiortonormforsmall
datasetsand/ordatasetswithalotofmissingvalues.
Manyoftheotherimputationmethodsinmicearespeciallysuitedforoneparticulardata
type.Forexample,binarycategoricalvariablesuselogregbydefault;thisislikenormbut
useslogisticregressiontoimputeabinaryoutcome.Similarly,non-binarycategoricaldata
usesmultinomialregression—micecallsthismethodpolyreg.
Multipleimputationinpractice
Thereareafewstepstofollowanddecisionstomakewhenusingthispowerful
imputationtechnique:
ArethedataMAR?:Andbehonest!IfthemechanismislikelynotMAR,thenmore
complicatedmeasureshavetobetaken.
Arethereanyderivedterms,redundantvariables,orirrelevantvariablesinthedata
set?:Anyofthesetypesofvariableswillinterferewiththeregressionprocess.
Irrelevantvariables—likeuniqueIDs—willnothaveanypredictivepower.Derived
termsorredundantvariables—likehavingacolumnforweightinpoundsandgrams,
oracolumnforareainadditiontoalengthandwidthcolumn—willsimilarly
interferewiththeregressionstep.
Convertallcategoricalvariablestofactors,otherwisemicewillnotbeabletotellthat
thevariableissupposedtobecategorical.
Choosenumberofiterationsandm:Bydefault,thesearebothfive.Usingfive
iterationsisusuallyokay—andwe’llbeabletotellifweneedmore.Fiveimputations
areusuallyokay,too,butwecanachievemorestatisticalpowerfrommoreimputed
datasets.Isuggestsettingmto20,unlesstheprocessingpowerandtimecan’tbe
spared.
Chooseanimputationmethodforeachvariable:Youcanstickwiththedefaultsas
longasyouareawareofwhattheyareandthinkthey’retherightfit.
1. Choosethepredictors:Letmiceusealltheavailablecolumnsaspredictorsas
longasderivedtermsandredundant/irrelevantcolumnsareremoved.Notonly
doesusingmorepredictorsresultinreducedbias,butitalsoincreasesthe
likelihoodthatthedataisMAR.
2. Performtheimputations
3. Audittheimputations
4. Performanalysiswiththeimputations
5. Pooltheresultsoftheanalyses
Beforewegetdowntoit,let’scallthemicefunctiononourdataframewithmissingdata,
anduseitsdefaultarguments,justtoseewhatweshouldn’tdoandwhy:
#wearegoingtosettheseedandprintFlagtoFALSE,but
#everythingelsewillthedefaultargument
imp<-mice(miss_mtcars,seed=3,printFlag=FALSE)
print(imp)
-----------------------------Multiplyimputeddataset
Call:
mice(data=miss_mtcars,printFlag=FALSE,seed=3)
Numberofmultipleimputations:5
Missingcellspercolumn:
mpgcyldisphpdratwtqsecvsamgearcarb
55007343000
Imputationmethods:
mpgcyldisphpdratwtqsecvsamgearcarb
"pmm""pmm""""""pmm""pmm""pmm""pmm"""""""
VisitSequence:
mpgcyldratwtqsecvs
125678
PredictorMatrix:
mpgcyldisphpdratwtqsecvsamgearcarb
mpg01111111111
cyl10111111111
disp00000000000
...
Randomgeneratorseedvalue:3
Thefirstthingwenotice(onlinefouroftheoutput)isthatmicechosetocreatefive
multiplyimputeddatasets,bydefault.Aswediscussed,thisisn’tabaddefault,butmore
imputationcanonlyimproveourstatisticalpower(ifonlymarginally);whenweimpute
thisdatasetinearnest,wewillusem=20.
Thesecondthingwenotice(onlines8-10oftheoutput)isthatitusedpredictivemean
matchingastheimputationmethodforallthecolumnswithmissingdata.Ifyourecall,
predictivemeanmatchingisthedefaultimputationmethodfornumericcolumns.
However,vsandcylarebinarycategoricalandnon-binarycategoricalvariables,
respectively.Becausewedidn’tconvertthemtofactors,micethinksthesearejustregular
numericcolumns.We’llhavetofixthis.
Thelastthingweshouldnoticehereisthepredictormatrix(startingonline14).Eachrow
andcolumnofthepredictormatrixreferstoacolumninthedatasettoimpute.Ifacell
containsa1,itmeansthatthevariablereferredtointhecolumnisusedasapredictorfor
thevariableintherow.Thefirstrowindicatesthatallavailableattributesareusedtohelp
predictmpgwiththeexceptionofmpgitself.Allthevaluesinthediagonalare0,because
micewon’tuseanattributetopredictitself.Notethatthedisp,hp,am,gear,andcarb
rowsallcontain`0`s—thisisbecausethesevariablesarecomplete,anddon’tneedtouse
anypredictors.
Sincewethoughtcarefullyaboutwhethertherewereanyattributesthatshouldbe
removedbeforeweperformtheimputation,wecanusemice‘sdefaultpredictormatrixfor
thisdataset.Iftherewereanynon-predictiveattributes(likeuniqueidentifiers,redundant
variables,andsoon)wewouldhaveeitherhadtoremovethem(easiestoption),orinstruct
micenottousethemaspredictors(harder).
Let’snowcorrecttheissuesthatwe’vediscussed.
#convertcategoricalvariablesintofactors
miss_mtcars$vs<-factor(miss_mtcars$vs)
miss_mtcars$cyl<-factor(miss_mtcars$cyl)
imp<-mice(miss_mtcars,m=20,seed=3,printFlag=FALSE)
imp$method
------------------------------------mpgcyldisphpdrat
"pmm""polyreg""""""pmm"
wtqsecvsamgear
"pmm""pmm""logreg"""""
carb
""
Nowmicehascorrectedtheimputationmethodofcylandvstotheircorrectdefaults.In
truth,cylisakindofdiscretenumericvariablecalledanordinalvariable,whichmeans
thatyetanotherimputationmethodmaybeoptimalforthatattribute,but,forthesakeof
simplicity,we’lltreatitasacategoricalvariable.
Beforewegettousetheimputationsinananalysis,wehavetochecktheoutput.Thefirst
thingweneedtocheckistheconvergenceoftheiterations.Recallthatforimputingdata
withmissingvaluesinmultiplecolumns,multipleimputationrequiresiterationoverall
thesecolumnsafewtimes.Ateachiteration,miceproducesimputations—andsamples
newparameterestimatesfromtheparameters’posteriordistributions—forallcolumnsthat
needtobeimputed.Thefinalimputations,foreachmultiplyimputeddatasetm,arethe
imputedvaluesfromthefinaliteration.
IncontrasttowhenweusedMCMCinChapter7,BayesianMethodstheconvergencein
miceismuchfaster;itusuallyoccursinjustafewiterations.However,asinChapter7,
BayesianMethods,visuallycheckingforconvergenceishighlyrecommended.Weeven
checkforitsimilarly;whenwecalltheplotfunctiononthevariablethatweassignthe
miceoutputto,itdisplaystraceplotsofthemeanandstandarddeviationofallthe
variablesinvolvedintheimputations.Eachlineineachplotisoneofthemimputations.
plot(imp)
Figure11.3:Asubsetofthetraceplotsproducedbyplottinganobjectreturnedbyamice
imputation
Asyoucanseefromtheprecedingtraceplotonimp,therearenocleartrendsandthe
variablesarealloverlappingfromoneiterationtothenext.Putanotherway,thevariance
withinachain(therearemchains)shouldbeaboutequaltothevariancebetweenthe
chains.Thisindicatesthatconvergencewasachieved.
Ifconvergencewasnotachieved,youcanincreasethenumberofiterationsthatmice
employsbyexplicitlyspecifyingthemaxitparametertothemicefunction.
Note
Toseeanexampleofnon-convergence,takealookatFigures7and8inthepaperthat
describesthispackagewrittenbytheauthorsofthepackage’themselves.Itisavailableat
http://www.jstatsoft.org/article/view/v045i03.
Thenextstepistomakesuretheimputedvaluesarereasonable.Ingeneral,wheneverwe
quicklyreviewtheresultsofsomethingtoseeiftheymakesense,itiscalledasanitytest
orsanitycheck.Withthefollowingline,we’regoingtodisplaytheimputedvaluesforthe
fivemissingmpgsforthefirstsiximputations:
imp$imp$mpg[,1:6]
------------------------------------
123456
Duster36019.216.417.315.515.019.2
CadillacFleetwood15.213.315.013.310.417.3
ChryslerImperial10.415.015.016.410.410.4
Porsche914-227.322.821.422.821.415.5
FerrariDino19.221.419.215.218.119.2
Thesesurelookreasonable.Abettermethodforsanitycheckingistocalldensityploton
thevariablethatweassignthemiceoutputto:
densityplot(imp)
Figure11.4:Densityplotsofalltheimputedvaluesformpg,drat,wt,andqsec.Each
imputationhasitsowndensitycurveineachquadrant
Thisdisplays,foreveryattributeimputed,adensityplotoftheactualnon-missingvalues
(thethickline)andtheimputedvalues(thethinlines).Wearelookingtoseethatthe
distributionsaresimilar.Notethatthedensitycurveoftheimputedvaluesextendmuch
higherthantheobservedvalues’densitycurveinthiscase.Thisispartlybecausewe
imputedsofewvariablesthatthereweren’tenoughdatapointstoproperlysmooththe
densityapproximation.Heightandnon-smoothnessnotwithstanding,thesedensityplots
indicatenooutlandishbehavioramongtheimputedvariables.
Wearenowreadyfortheanalysisphase.Wearegoingtoperformlinearregressionon
eachimputeddatasetandattempttomodelmpgasafunctionofam,wt,andqsec.Instead
ofrepeatingtheanalysesoneachdatasetmanually,wecanapplyanexpressiontoallthe
datasetsatonetimewiththewithfunction,asfollows:
imp_models<-with(imp,lm(mpg~am+wt+qsec))
Wecouldtakeapeakattheestimatedcoefficientsfromeachdatasetusinglapplyonthe
analysesattributeofthereturnedobject:
lapply(imp_models$analyses,coef)
--------------------------------[[1]]
(Intercept)amwtqsec
18.15340952.0284014-4.40548250.8637856
[[2]]
(Intercept)amwtqsec
8.3754553.336896-3.5208821.219775
[[3]]
(Intercept)amwtqsec
5.2545783.277198-3.2330961.337469…......
Finally,let’spooltheresultsoftheanalyses(withthepoolfunction),andcallsummaryon
it:
pooled_model<-pool(imp_models)
summary(pooled_model)
---------------------------------estsetdfPr(>|t|)
(Intercept)7.0497819.22545810.76416617.633190.454873254
am3.1820491.74454441.82400021.366000.082171407
wt-3.4135340.9983207-3.41927614.998160.003804876
qsec1.2707120.36601313.47176519.932960.002416595
lo95hi95nmisfmilambda
(Intercept)-12.361128126.460690NA0.34591970.2757138
am-0.44214956.80624700.22903590.1600952
wt-5.5414268-1.28564130.43248280.3615349
qsec0.50705702.03436640.27360260.2042003
ThoughwecouldhaveperformedthepoolingourselvesusingtheequationsthatDonald
Rubinoutlinedinhis1987classicMultipleImputationforNonresponseinSurveys,itis
lessofahassleandlesserror-pronetohavethepoolfunctiondoitforus.Readerswhoare
interestedinthepoolingrulesareencouragedtoconsulttheaforementionedtext.
Asyoucansee,foreachparameter,poolhascombinedthecoefficientestimateand
standarderrors,andcalculatedtheappropriatedegreesoffreedom.Theseallowustot-test
eachcoefficientagainstthenullhypothesisthatthecoefficientisequalto0,producepvaluesforthet-test,andconstructconfidenceintervals.
Thestandarderrorsandconfidenceintervalsarewiderthanthosethatwouldhaveresulted
fromlinearregressiononasingleimputeddataset,butthat’sbecauseitisappropriately
takingintoaccountouruncertaintyregardingwhatthemissingvalueswouldhavebeen.
Thereare,atpresenttime,alimitednumberofanalysesthatcanbeautomaticallypooled
bymice—themostimportantbeinglm/glm.Ifyourecall,though,thegeneralizedlinear
modelisextremelyflexible,andcanbeusedtoexpressawidearrayofdifferentanalyses.
Byextension,wecouldusemultipleimputationfornotonlylinearregressionbutlogistic
regression,Poissonregression,t-tests,ANOVA,ANCOVA,andmore.
Analysiswithunsanitizeddata
Veryoften,therewillbeerrorsormistakesindatathatcanseverelycomplicateanalyses—
especiallywithpublicdataordataoutsideofyourorganization.Forexample,saythereis
astraycommaorpunctuationmarkinacolumnthatwassupposedtobenumeric.Ifwe
aren’tcareful,Rwillreadthiscolumnascharacter,andsubsequentanalysismay,inthe
bestcasescenario,fail;itisalsopossible,however,thatouranalysiswillsilentlychug
along,andreturnanunexpectedresult.Thiswillhappen,forexample,ifwetrytoperform
linearregressionusingthepunctuation-containing-but-otherwise-numericcolumnasa
predictor,whichwillcompelRtoconvertitintoafactorthinkingthatitisacategorical
variable.
Intheworst-casescenario,ananalysiswithunsanitizeddatamaynoterroroutorreturn
nonsensicalresults,butreturnresultsthatlookplausiblebutareactuallyincorrect.For
example,itiscommon(forsomereason)toencodemissingdatawith999insteadofNA;
performingaregressionanalysiswith999inanumericcolumncanseverelyadulterateour
linearmodels,butoftennotenoughtocauseclearlyinappropriateresults.Thismistake
maythengoundetectedindefinitely.
Someproblemslikethesecould,rathereasily,bedetectedinsmalldatasetsbyvisually
auditingthedata.Often,however,mistakeslikethesearenotoriouslyeasytomiss.
Further,visualinspectionisanuntenablesolutionfordatasetswiththousandsofrowsand
hundredsofcolumns.Anysustainablesolutionmustoff-loadthisauditingprocesstoR.
ButhowdowedescribeaberrantbehaviortoRsothatitcancatchmistakesonitsown?
Thepackageassertrseekstodothisbyintroducinganumberofdatacheckingverbs.
Usingassertrgrammar,theseverbs(functions)canbecombinedwithsubjects(data)in
differentwaystoexpressarichvocabularyofdatavalidationtasks.
Moreprosaically,assertrprovidesasuiteoffunctionsdesignedtoverifytheassumptions
aboutdataearlyintheanalysisprocess,beforeanytimeiswastedcomputingonbaddata.
Theideaistoprovideasmuchinformationasyoucanabouthowyouexpectthedatato
lookupfrontsothatanydeviationfromthisexpectationcanbedealtwithimmediately.
Giventhattheassertrgrammarisdesignedtobeabletodescribeabouquetoferrorcheckingroutines,ratherthanlistallthefunctionsandfunctionalitiesthatthepackage
provides,itwouldbemorehelpfultovisitparticularusecases.
Twothingsbeforewestart.First,makesureyouinstallassertr.Second,bearinmind
thatalldataverificationverbsinassertrtakeadataframetocheckastheirfirst
argument,andeither(a)returnsthesamedataframeifthecheckpasses,or(b)producesa
fatalerror.Sincetheverbsreturnacopyofthechosendataframeifthecheckpasses,the
mainidiominassertrinvolvesreassignmentofthereturningdataframeafteritpasses
thecheck.
a_dataset<-CHECKING_VERB(a_dataset,....)
Checkingforout-of-boundsdata
It’scommonfornumericvaluesinacolumntohaveanaturalconstraintonthevaluesthat
itshouldhold.Forexample,ifacolumnrepresentsapercentofsomething,wemightwant
tocheckifallthevaluesinthatcolumnarebetween0and1(or0and100).Inassertr,
wetypicallyusethewithin_boundsfunctioninconjunctionwiththeassertverbto
ensurethatthisisthecase.Forexample,ifweaddedacolumntomtcarsthatrepresented
thepercentofheaviestcar’sweight,theweightofeachcaris:
library(assertr)
mtcars.copy<-mtcars
mtcars.copy$Percent.Max.Wt<-round(mtcars.copy$wt/
max(mtcars.copy$wt),
2)
mtcars.copy<-assert(mtcars.copy,within_bounds(0,1),
Percent.Max.Wt)
within_boundsisactuallyafunctionthattakesthelowerandupperboundsandreturnsa
predicate,afunctionthatreturnsTRUEorFALSE.Theassertfunctionthenappliesthis
predicatetoeveryelementofthecolumnspecifiedinthethirdargument.Iftherearemore
thanthreearguments,assertwillassumetherearemorecolumnstocheck.
Usingwithin_bounds,wecanalsoavoidthesituationwhereNAvaluesarespecifiedas
“999”,aslongasthesecondargumentinwithin_boundsislessthanthisvalue.
within_boundscantakeotherinformationsuchaswhethertheboundsshouldbeinclusive
orexclusive,orwhetheritshouldignoretheNAvalues.Toseetheoptionsforthis,andall
theotherfunctionsinassertr,usethehelpfunctiononthem.
Let’sseeanexampleofwhatitlookslikewhentheassertfunctionfails:
mtcars.copy$Percent.Max.Wt[c(10,15)]<-2
mtcars.copy<-assert(mtcars.copy,within_bounds(0,1),
Percent.Max.Wt)
-----------------------------------------------------------Error:
Vector'Percent.Max.Wt'violatesassertion'within_bounds'2times(e.g.
[2]atindex10)
Wegetaninformativeerrormessagethattellsushowmanytimestheassertionwas
violated,andtheindexandvalueofthefirstoffendingdatum.
Withassert,wehavetheoptionofcheckingaconditiononmultiplecolumnsatthesame
time.Forexample,noneofthemeasurementsiniriscanpossiblybenegative.Here’s
howwemightmakesureourdatasetiscompliant:
iris<-assert(iris,within_bounds(0,Inf),
Sepal.Length,Sepal.Width,
Petal.Length,Petal.Width)
#orsimply"-Species"becausethat
#willincludeallcolumns*except*Species
iris<-assert(iris,within_bounds(0,Inf),
-Species)
Onoccasion,wewillwanttocheckelementsforadherencetoamorecomplicatedpattern.
Forexample,let’ssaywehadacolumnthatweknewwaseitherbetween-10and-20,or
10and20.Wecancheckforthisbyusingthemoreflexibleverifyverb,whichtakesa
logicalexpressionasitssecondargument;ifanyoftheresultsinthelogicalexpressionis
FALSE,verifywillcauseanerror.
vec<-runif(10,min=10,max=20)
#randomlyturnsomeelementsnegative
vec<-vec*sample(c(1,-1),10,
replace=TRUE)
example<-data.frame(weird=vec)
example<-verify(example,((weird<20&weird>10)|
(weird<-10&weird>-20)))
#or
example<-verify(example,abs(weird)<20&abs(weird)>10)
#passes
example$weird[4]<-0
example<-verify(example,abs(weird)<20&abs(weird)>10)
#fails
------------------------------------Errorinverify(example,abs(weird)<20&abs(weird)>10):
verificationfailed!(1failure)
Checkingthedatatypeofacolumn
Bydefault,mostofthedataimportfunctionsinRwillattempttoguessthedatatypefor
eachcolumnattheimportphase.Thisisusuallynice,becauseitsavesusfromtedious
work.However,itcanbackfirewhenthereare,forexample,straypunctuationmarksin
whataresupposedtobenumericcolumns.Toverifythis,wecanusetheassertfunction
withtheis.numericbasefunction:
iris<-assert(iris,is.numeric,-Species)
Wecanusetheis.characterandis.logicalfunctionswithassert,too.
Analternativemethodthatwilldisallowtheimportofunexpecteddatatypesistospecify
thedatatypethateachcolumnshouldbeatthedataimportphasewiththecolClasses
optionalargument:
iris<-read.csv("PATH_TO_IRIS_DATA.csv",
colClasses=c("numeric","numeric",
"numeric","numeric",
"character"))
Thissolutioncomeswiththeaddedbenefitofspeedingupthedataimportprocess,since
Rdoesn’thavetowastetimeguessingeachcolumn’sdatatype.
Checkingforunexpectedcategories
Anotherdataintegrityimproprietythatis,unfortunately,verycommonisthemislabeling
ofcategoricalvariables.Therearetwotypesofmislabelingofcategoriesthatcanoccur:
anobservation’sclassismis-entered/mis-recorded/mistakenforthatofanotherclass,or
theobservation’sclassislabeledinawaythatisnotconsistentwiththerestofthelabels.
Toseeanexampleofwhatwecandotocombattheformercase,readassertr‘svignette.
Thelattercasecoversinstanceswhere,forexample,thespeciesofiriscouldbemisspelled
(suchas“versicolour”,“verginica”)orcaseswherethepatternestablishedbythemajority
ofclassnamesisignored(“irissetosa”,“i.setosa”,“SETOSA”).Eitherway,these
misspecificationsprovetobeagreatbanetodataanalystsforseveralreasons.For
example,ananalysisthatispredicateduponatwo-classcategoricalvariable(forexample,
logisticregression)willnowhavetocontendwithmorethantwocategories.Yetanother
wayinwhichunexpectedcategoriescanhauntyouisbyproducingstatisticsgroupedby
differentvaluesofacategoricalvariable;ifthecategorieswereextractedfromthemain
datamanually—withsubset,forexample,asopposedtowithby,tapply,oraggregate—
you’llbemissingpotentiallycrucialobservations.
Ifyouknowwhatcategoriesyouareexpectingfromthestart,youcanusethein_set
function,inconcertwithassert,toconfirmthatallthecategoriesofaparticularcolumn
aresquarelycontainedwithinapredeterminedset.
#passes
iris<-assert(iris,in_set("setosa","versicolor",
"virginica"),Species)
#messupthedata
iris.copy<-iris
#Wehavetomakethe'Species'columnnot
#afactor
ris.copy$Species<-as.vector(iris$Species)
iris.copy$Species[4:9]<-"SETOSA"
iris.copy$Species[135]<-"verginica"
iris.copy$Species[95]<-"i.versicolor"
#fails
iris.copy<-assert(iris.copy,in_set("setosa","versicolor",
"virginica"),Species)
------------------------------------------Error:
Vector'Species'violatesassertion'in_set'8times(e.g.[SETOSA]at
index4)
Ifyoudon’tknowthecategoriesthatyoushouldbeexpecting,apriori,thefollowing
incantation,whichwilltellyouhowmanyrowseachcategorycontains,mayhelpyou
identifythecategoriesthatareeitherrareormisspecified:
by(iris.copy,iris.copy$Species,nrow)
Checkingforoutliers,entryerrors,orunlikelydata
points
Automaticoutlierdetection(sometimesknownasanomalydetection)issomethingthata
lotofanalystsscoffatandviewasapipedream.Thoughthecreationofaroutinethat
automagicallydetectsallerroneousdatapointswith100percentspecificityandprecision
isimpossible,unmistakablymis-entereddatapointsandflagrantoutliersarenothardto
detectevenwithverysimplemethods.Inmyexperience,therearealotoferrorsofthis
type.
Onesimplewaytodetectthepresenceofamajoroutlieristoconfirmthateverydata
pointiswithinsomennumberofstandarddeviationsawayfromthemeanofthegroup.
assertrhasafunction,within_n_sds—inconjunctionwiththeinsistverb—todojust
this;ifwewantedtocheckthateverynumericvalueinirisiswithinfivestandard
deviationsofitsrespectivecolumn’smean,wecouldexpresssothusly:
iris<-insist(iris,within_n_sds(5),-Species)
Anissuewithusingstandarddeviationsawayfromthemean(z-scores)fordetecting
outliersisthatboththemeanandstandarddeviationareinfluencedheavilybyoutliers;
thismeansthattheverythingwearetryingtodetectisobstructingourabilitytofindit.
Thereisamorerobustmeasureforfindingcentraltendencyanddispersionthanthemean
andstandarddeviation:themedianandmedianabsolutedeviation.Themedianabsolute
deviationisthemedianoftheabsolutevalueofalltheelementsofavectorsubtractedby
thevector’smedian.
assertrhasasistertowithin_n_sds,within_n_mads,thatcheckseveryelementofa
vectortomakesureitiswithinnmedianabsolutedeviationsawayfromitscolumn’s
median.
iris<-insist(iris,within_n_mads(4),-Species)
iris$Petal.Length[5]<-15
iris<-insist(iris,within_n_mads(4),-Species)
--------------------------------------------Error:
Vector'Petal.Length'violatesassertion'within_n_mads'1time(value[15]
atindex5)
Inmyexperience,within_n_madscanbeaneffectiveguardagainstillegitimateunivariate
outliersifnischosencarefully.
Theexamplesherehavebeenfocusingonoutlieridentificationintheunivariatecase—
acrossonedimensionatatime.Often,therearetimeswhereanobservationistruly
anomalousbutitwouldn’tbeevidentbylookingatthespreadofeachdimension
individually.assertrhassupportforthistypeofmultivariateoutlieranalysis,butafull
discussionofitwouldrequireabackgroundoutsidethescopeofthistext.
Chainingassertions
Thecheckassertraimstomakethecheckingofassumptionssoeffortlessthattheuser
neverfeelstheneedtoholdbackanyimplicitassumption.Therefore,it’sexpectedthatthe
userusesmultiplechecksononedataframe.
Theusageexamplesthatwe’veseensofararereallyonlyappropriateforoneortwo
checks.Forexample,ausagepatternsuchasthefollowingisclearlyunworkable:
iris<-
CHECKING_CONSTRUCT4(CHECKING_CONSTRUCT3(CHECKING_CONSTRUCT2(CHECKING_CONSTR
UCT1(this,...),...),...),...)
Tocombatthisvisualcacophony,assertrprovidesdirectsupportforchainingmultiple
assertionsbyusingthe“piping”constructfromthemagrittrpackage.
Thepipeoperatorofmagrittr‘,%>%,worksasfollows:ittakestheitemontheleft-hand
sideofthepipeandinsertsit(bydefault)intothepositionofthefirstargumentofthe
functionontheright-handside.Thefollowingaresomeexamplesofsimplemagrittr
usagepatterns:
library(magrittr)
4%>%sqrt#2
iris%>%head(n=3)#thefirst3rowsofiris
iris<-iris%>%assert(within_bounds(0,Inf),-Species)
Sincethereturnvalueofapassedassertrcheckisthevalidateddataframe,youcanuse
themagrittrpipeoperatortotackonmorechecksinawaythatlendsitselftoeasier
humanunderstanding.Forexample:
iris<-iris%>%
assert(is.numeric,-Species)%>%
assert(within_bounds(0,Inf),-Species)%>%
assert(in_set("setosa","versicolor","virginica"),Species)%>%
insist(within_n_mads(4),-Species)
#or,equivalently
CHECKS<-.%>%
assert(is.numeric,-Species)%>%
assert(within_bounds(0,Inf),-Species)%>%
assert(in_set("setosa","versicolor","virginica"),Species)%>%
insist(within_n_mads(4),-Species)
iris<-iris%>%CHECKS
Whenchainingassertions,Iliketoputthemostintegralandgeneralonerightatthetop.I
alsoliketoputtheassertionsmostlikelytobeviolatedrightatthetopsothatexecutionis
terminatedbeforeanymorechecksarerun.
Therearemanyothercapabilitiesbuiltintotheassertrmultivariateoutlierchecking.For
moreinformationaboutthese,readthepackage’svignette,(vignette("assertr")).
Onthemagrittrside,besidestheforward-pipeoperator,thispackagesportssomeother
veryhelpfulpipeoperators.Additionally,magrittrallowsthesubstitutionattherightside
ofthepipeoperatortooccuratlocationsotherthanthefirstargument.Formore
informationaboutthewonderfulmagrittrpackage,readitsvignette.
Othermessiness
Aswediscussedinthischapter’spreface,therearecountlesswaysthatadatasetmaybe
messy.Therearemanyothermessysituationsandsolutionsthatwecouldn’tdiscussat
lengthhere.Inorderthatyou,dearreader,arenotleftinthedarkregardingcustodial
solutions,herearesomeotherremedieswhichyoumayfindhelpfulalongyouranalytics
journey:
OpenRefine
ThoughOpenRefine(formerlyGoogleRefine)doesn’thaveanythingtodowithRperse,
itisasophisticatedtoolforworkingwithandforcleaningupmessydata.Amongits
numerous,sophisticatedcapabilitiesisthecapacitytoauto-detectmisspelledor
mispecifiedcategoriesandfixthemattheclickofabutton.
Regularexpressions
Supposeyoufindthattherearecommasseparatingeverythirddigitofthenumbersina
numericcolumn.Howwouldyouremovethem?Orsupposeyouneededtostripa
currencysymbolfromvaluesincolumnsthatholdmonetaryvaluessothatyoucan
computewiththemasnumbers.These,andvastlymorecomplicatedtexttransformations,
canbeperformedusingregularexpressions(aformalgrammarforspecifyingthesearch
patternsintext)andassociateRfunctionslikegrepandsub.Anytimespentlearning
regularexpressionswillpayenormousdividendsoveryourcareerasananalyst,andthere
aremanygreat,freetutorialsavailableonthewebforthispurpose.
tidyr
Thereareafewdifferentwaysinwhichyoucanrepresentthesametabulardataset.Inone
form—calledlong,narrow,stacked,orentity-attribute-valuemodel—eachrowcontains
anobservationID,avariablename,andthevalueofthatvariable.Forexample:
memberattributevalue
1RingoStarrbirthyear1940
2PaulMcCartneybirthyear1942
3GeorgeHarrisonbirthyear1943
4JohnLennonbirthyear1940
5RingoStarrinstrumentDrums
6PaulMcCartneyinstrumentBass
7GeorgeHarrisoninstrumentGuitar
8JohnLennoninstrumentGuitar
Inanotherform(calledwideorunstacked),eachoftheobservation’svariablesarestored
ineachcolumn:
memberbirthyearinstrument
1GeorgeHarrison1943Guitar
2JohnLennon1940Guitar
3PaulMcCartney1942Bass
4RingoStarr1940Drums
Ifyoueverneedtoconvertbetweentheserepresentations,(whichisasomewhatcommon
operation,inpractice)tidyrisyourtoolforthejob.
Exercises
Thefollowingareafewexercisesforyoutostrengthenyourgraspovertheconcepts
learnedinthischapter:
Normally,whenthereismissingdataforaquestionsuchas“Whatisyourincome?”,
westronglysuspectanMNARmechanism,becauseweliveinadystopiathatequates
wealthwithworth.Asaresult,theparticipantswiththelowestincomemaybe
embarrassedtoanswerthatquestion.Intherelevantsection,weassumedthat
becausethequestionwaspoorlywordedandwecouldaccountforwhetherEnglish
wasthefirstlanguageoftheparticipant,themechanismisMAR.Ifwewerewrong
aboutthisreason,anditwasreallybecausethelowerincomeparticipantswere
reticenttoadmittheirincome,whatwouldthemissingdatamechanismbenow?If,
however,thedifferencesinincomewerefullyexplainedbywhetherEnglishwasthe
firstlanguageoftheparticipant,whatwouldthemissingdatamechanismbeinthat
case?
Findadatasetonthewebwithmissingdata.Whatdoesitusetodenotethatdatais
missing?Thinkaboutthatdataset’smissingdatamechanism.Isthereachancethat
thisdataisMNAR?
Findafreelyavailablegovernmentdatasetontheweb.Readthedataset’sdescription,
andthinkaboutwhatassumptionsyoumightmakeaboutthedatawhenplanninga
certainanalysis.TranslatetheseintoactualcodesothatRcancheckthemforyou.
Werethereanydeviationsfromyourexpectations?
Whentwoautonomousindividualsdecidetovoluntarilytrade,thetransactioncanbe
inbothparties’bestinterests.Doesitnecessarilyfollowthatavoluntarytrade
betweennationsbenefitsbothstates?Whyorwhynot?
Summary
“Messydata”—nomatterwhatdefinitionyouuse—presentahugeroadblockforpeople
whoworkwithdata.Thischapterfocusedontwoofthemostnotoriousandprolific
culprits:missingdataanddatathathasnotbeencleanedorauditedforquality.
Onthemissingdataside,youlearnedhowtovisualizemissingdatapatterns,andhowto
recognizedifferenttypesofmissingdata.Yousawafewunprincipledwaysoftacklingthe
problem,andlearnedwhytheyweresuboptimalsolutions.Multipleimputation,soyou
learned,addressestheshortcomingsoftheseapproachesand,throughitsusageofseveral
imputeddatasets,correctlycommunicatesouruncertaintysurroundingtheimputed
values.
Onunsanitizeddata,wesawthatthe,perhaps,optimalsolution(visuallyauditingthedata)
wasuntenableformoderatelysizeddatasetsorlarger.Wediscoveredthatthegrammarof
thepackageassertrprovidesamechanismtooffloadthisauditingprocesstoR.Younow
haveafewassertrchecking“recipes”underyourbeltforsomeofthemorecommon
manifestationsofthemistakesthatplaguedatathathasnotbeenscrutinized.
Chapter12.DealingwithLargeData
Inthepreviouschapter,wespokeofsolutionstocommonproblemsthatfallunderthe
umbrellatermofmessydata.Inthischapter,wearegoingtosolvesomeoftheproblems
relatedtoworkingwithlargedatasets.
Problems,incaseofworkingwithlargedatasets,canoccurinRforafewreasons.For
one,R(andmostotherlanguages,forthatmatter)wasdevelopedduringatimewhen
commoditycomputersonlyhadoneprocessor/core.ThismeansthatthevanillaRcode
can’texploitmultipleprocessor/multiplecores,whichcanoffersubstantialspeed-ups.
AnothersalientreasonwhyRmightrunintotroubleanalyzinglargedatasetsisbecauseR
requiresthedataobjectsthatitworkswithtobestoredcompletelyinRAMmemory.If
yourdatasetexceedsthecapacityofyourRAM,youranalyseswillslowdowntoacrawl.
Whenonethinksofproblemsrelatedtoanalyzinglargedatasets,theymaythinkofBig
Data.Onecanscarcelybeinvolved(oreveninterested)inthefieldofdataanalysis
withouthearingaboutbigdata.Istayawayfromthatterminthischapterfortworeasons:
(a)theproblemsandtechniquesinthischapterwillstillbeapplicablelongafterthe
buzzwordbeginstofadefrompublicmemory,and(b)problemsrelatedtotrulybigdata
arerelativelyuncommon,andoftenrequirespecializedtoolsandknow-howthatisbeyond
thescopeofthisbook.
Somehavesuggestedthatthedefinitionofbigdatabedatathatistoobigtofitinyour
computer’smemoryatonetime.Personally,Icallthislargedata—andnotjustbecauseI
haveapenchantforsplittinghairs!Ireservethetermbigdatafordatathatissomassive
thatitrequiresmanyhundredsofcomputersandspecialconsiderationinordertobestored
andprocessed.
Sometimes,problemsrelatedtohigh-dimensionaldataareconsideredlargedataproblems,
too.Unfortunately,solvingtheseproblemsoftenrequiresabackgroundandmathematics
beyondthescopeofthisbook,andwewillnotbediscussinghigh-dimensionalstatistics.
ThischapterismoreaboutoptimizingtheRcodetosqueezehigherperformanceoutofit
sothatcalculationsandanalyseswithlargedatasetsbecomecomputationallytractable.
So,perhapsthischaptershouldmoreaptlybenamedHighPerformanceR.
Unfortunately,thistitleismoreostentatious,andwouldn’tfitthenamingpattern
establishedbythepreviouschapter.
Eachofthetop-levelsectionsinthischapterwilldiscussaspecifictechniqueforwriting
higherperformingRcode.
Waittooptimize
ProminentcomputerscientistandmathematicianDonaldKnuthfamouslystated:
Prematureoptimizationistherootofallevil.
I,personally,holdthatmoneyistherootofallevil,butprematureoptimizationis
definitelyupthere!
Whyisprematureoptimizationsoevil?Well,thereareafewreasons.First,programmers
cansometimesbeprettybadatidentifyingwhatthebottleneckofaprogram—the
routine(s)thathavetheslowestthroughput—isandoptimizethewrongpartsofa
program.Identificationofbottleneckscanmostaccuratelybeperformedbyprofilingyour
codeafterit’sbeencompletedinanun-optimizedform.
Secondly,clevertricksandshortcutsforspeedingupcodeoftenintroducesubtlebugsand
unexpectedbehavior.Now,thespeedupofthecode—ifthereisany!—mustbetakenin
contextwiththetimeittooktocompletethebug-finding-and-fixingexpedition;
occasionally,anetnegativeamountoftimehasbeensavedwhenallissaidanddone.
Lastly,sinceprematureoptimizationliterallynecessitateswritingyourcodeinawaythat
isdifferentthanyounormallywould,itcanhavedeleteriouseffectsonthereadabilityof
thecodeandyourabilitytounderstanditwhenwelookbackonitaftersomeperiodof
time.AccordingtoStructureandInterpretationofComputerPrograms,oneofthemost
famoustextbooksincomputerscience,Programsmustbewrittenforpeopletoread,and
onlyincidentallyformachinestoexecute.Thisreflectsthefactthatthebulkofthetime
updatingorexpandingcodethatisalreadywrittenisspentonahumanhavingtoreadand
understandthecode—notthetimeittakesforthecomputertoexecuteit.Whenyou
prematurelyoptimize,youmaybecausingahugereductioninreadabilityinexchangefor
amarginalgaininexecutiontime.
Insummary,youshouldprobablywaittooptimizeyourcodeuntilyouaredone,andthe
performanceisdemonstrablyinadequate.
Usingabiggerandfastermachine
Insteadofrewritingcriticalsectionsofyourcode,considerrunningthecodeonamachine
withafasterprocessor,morecores,moreRAMmemory,fasterbusspeeds,and/orreduced
disklatency.Thissuggestionmayseemlikeaglibcop-out,butit’snot.Sure,usinga
biggermachineforyouranalyticssometimesmeansextramoney,butyourtime,dear
reader,ismoneytoo.If,overthecourseofyourwork,ittakesyoumanyhourstooptimize
yourcodeadequately,buyingorrentingabettermachinemayactuallyprovetobethe
morecost-effectivesolution.
Goingdownthisroadneedn’trequirethatyoupurchaseahigh-poweredmachine
outrightly;therearenowvirtualserversthatyoucanrentonlineforfiniteperiodsoftime
atreasonableprices.Someofthesevirtualserverscanbeconfiguredtohave2terabytesof
RAMand40virtualprocessors.Ifyouareinterestedinlearningmoreonthisoption,look
attheofferingsofDigitalOcean,AmazonElasticComputeCloud,ormanyothersimilar
serviceproviders.
Askyouremployerorresearchadvisorifthisisafeasibleoption.Ifyouareworkingfora
non-profitwithalimitedbudget,youmaybeabletoworkoutadealwithaparticularly
charitablecloudcomputingserviceprovider.Tell‘emthat‘Tony’sentyou!Butdon’t
actuallydothat.
Besmartaboutyourcode
Inmanycases,theperformanceoftheRcodecanbegreatlyimprovedbysimple
restructuringofthecode;thisdoesn’tchangetheoutputoftheprogram,justthewayitis
represented.Restructuringsofthistypeareoftenreferredtoascoderefactoring.The
refactoringsthatreallymakeadifferenceperformance-wiseusuallyhavetodowitheither
improvedallocationofmemoryorvectorization.
Allocationofmemory
ReferallthewaybacktoChapter5,UsingDatatoReasonAbouttheWorld.Remember
whenwecreatedamockpopulationofwomen’sheightsintheUS,andwerepeatedlytook
10,000samplesof40fromittodemonstratethesamplingdistributionofthesample
means?Inacodecomment,Imentionedinpassingthatthesnippetnumeric(10000)
createdanemptyvectorof10,000elements,butIneverexplainedwhywedidthat.Why
didn’twejustcreateavectorof1,andcontinuallytackoneachnewsamplemeantothe
endofitasfollows:
set.seed(1)
all.us.women<-rnorm(10000,mean=65,sd=3.5)
means.of.our.samples.bad<-c(1)
#I'mincreasingthenumberof
#samplesto30,000toproveapoint
for(iin1:30000){
a.sample<-sample(all.us.women,40)
means.of.our.samples.bad[i]<-mean(a.sample)
}
ItturnsoutthatRstoresvectorsincontiguousaddressesinyourcomputer’smemory.This
meansthateverytimeanewsamplemeangetstackedontotheendof
means.of.our.samples.bad,Rhastomakesurethatthenextmemoryblockisfree.Ifit
isnot,Rhastofindacontiguoussectionofmemorythancanfitalltheelements,copythe
vectorover(elementbyelement),andfreethememoryintheoriginallocation.In
contrast,whenwecreatedanemptyvectoroftheappropriatenumberofelements,Ronly
hadtofindamemorylocationwiththerequisitenumberoffreecontiguousaddresses
once.
Let’sseejustwhatkindofdifferencethismakesinpractice.Wewillusethesystem.time
functiontotimetheexecutiontimeofboththeapproaches:
means.of.our.samples.bad<-c(1)
system.time(
for(iin1:30000){
a.sample<-sample(all.us.women,40)
means.of.our.samples.bad[i]<-mean(a.sample)
}
)
means.of.our.samples.good<-numeric(30000)
system.time(
for(iin1:30000){
a.sample<-sample(all.us.women,40)
means.of.our.samples[i]<-mean(a.sample)
}
)
------------------------------------usersystemelapsed
2.0240.4312.465
usersystemelapsed
0.6780.0040.684
Althoughanelapsedtimesavingoflessthanone/twosecondsdoesn’tseemlikeabig
deal,(a)itaddsup,and(b)thedifferencegetsmoreandmoredramaticasthenumberof
elementsinthevectorincrease.
Bytheway,thispreallocationbusinessappliestomatrices,too.
Vectorization
WereyouwonderingwhyRissoadamantaboutkeepingtheelementsofvectorsin
adjoiningmemorylocations?Well,ifRdidn’t,thentraversingavector(likewhenyou
applyafunctiontoeachelement)wouldrequirehuntingaroundthememoryspaceforthe
rightelementsindifferentlocations.Havingtheelementsallinarowgivesusan
enormousadvantage,performance-wise.
Tofullyexploitthisvectorrepresentation,ithelpstousevectorizedfunctions—whichwe
werefirstintroducedtoinChapter1,RefresheR.Thesevectorizedfunctionscall
optimized/blazingly-fastCcodetooperateonvectorsinsteadofonthecomparatively
slowerRcode.Forexample,let’ssaywewantedtosquareeachheightinthe
all.us.womenvector.Onewaywouldbetouseafor-looptosquareeachelementas
follows:
system.time(
for(iin1:length(all.us.women))
all.us.women[i]^2
)
-------------------------usersystemelapsed
0.0030.0000.003
Okay,notbadatall.Nowwhatifweappliedalambdasquaringfunctiontoeachelement
usingsapply?
system.time(
sapply(all.us.women,function(x)x^2)
)
----------------------usersystemelapsed
0.0060.0000.006
Okay,that’sworse.Butwecanuseafunctionthat’slikesapplyandwhichallowsusto
specifythetypeofreturnvalueinexchangeforafasterprocessingspeed:
>system.time(
+vapply(all.us.women,function(x)x^2,numeric(1))
+)
------------------------usersystemelapsed
0.0060.0000.005
Stillnotgreat.Finally,whatifwejustsquaretheentirevector?
system.time(
all.us.women^2
)
---------------------usersystemelapsed
000
Thiswassofastthatsystem.timedidn’thavetheresolutiontodetectanyprocessingtime
atall.Further,thiswayofwritingthesquaringfunctionalitywasbyfartheeasiesttoread.
Themoralofthestoryistousevectorizedoptionswheneveryoucan.AllofcoreR’s
arithmeticoperators(+,-,^,sqrt,log,andsoon)areofthistype.Additionally,usingthe
rowSumsandcolSumsfunctionsonmatricesisfasterthanapply(A_MATRIX,1,sum)and
apply(A_MATRIX,1,sum)respectively,formuchthesamereason.
Speakingofmatrices,beforewemoveon,youshouldknowthatcertainmatrixoperations
areblazinglyfastinR,becausetheroutinesareimplementedincompiledCand/orFortran
code.Ifyoudon’tbelieveme,trywritingandtestingtheperformanceofOLSregression
withoutusingmatrixmultiplication.
Ifyouhavethelinearalgebraknow-how,andhavetheoptiontorewriteacomputation
thatyouneedtoperformusingmatrixoperations,youshoulddefinitelytryitout.
Usingoptimizedpackages
ManyofthefunctionalitiesinbaseRhavealternativeimplementationsavailablein
contributedpackages.Quiteoften,thesepackagesofferafasterorlessmemory-intensive
substituteforthebaseRequivalent.Forexample,inadditiontoaddingatonofextra
functionality,theglmnetpackageperformsregressionfarfasterthanglminmy
experience.
Forfasterdataimport,youmightbeabletousefreadfromthedata.tablepackageor
theread_*familyoffunctionsfromthereadrpackage.Itisnotuncommonfordata
importtasksthatusedtotakeseveralhourstotakeonlyafewminuteswiththeseread
functions.
Forcommondatamanipulationtasks—likemerging(joining),conditionalselection,
sorting,andsoon—youwillfindthatthedata.tableanddplyrpackagesofferincredible
speedimprovements.BothofthesepackageshaveatonofuseRsthatswearbythem,and
thecommunitysupportissolid.You’dbewelladvisedtobecomeproficientinoneof
thesepackageswhenyou’reready.
Note
Asitturnsout,thesqldfpackagethatImentionedinpassinginChapter10,Sourcesof
Data—theonethatcanperformSQLqueriesondataframes—cansometimesoffer
performanceimprovementsforcommondatamanipulationtasks,too.Behindthescenes,
sqldf(bydefault)loadsyourdataframeintoatemporarySQLitedatabase,performsthe
queryinthedatabase’sSQLexecutionenvironment,returnstheresultsfromthedatabase
intheformofadataframe,anddestroysthetemporarydatabase.Sincethequeriesrunon
thedatabase,sqldfcan(a)sometimesperformthequeriesfasterthantheequivalent
nativeRcode,and(b)somewhatrelaxestheconstraintthatthedataobjects,whichRuses,
beheldcompletelyinmemory.
TheconstraintthatthedataobjectsinRmustbeabletofitintomemorycanbeareal
obstacleforpeoplewhoworkwithdatasetsthatareratherlarge,butjustshyofbeingbig
enoughtonecessitatespecialtools.Somecanthwartthisconstraintbystoringtheirdata
objectsinadatabase,andonlyusingselectedsubsets(thatwillfitinthememory).Others
cangetbyusingrandomsamplesoftheavailabledatainsteadofrequiringthewhole
datasettobeheldatonce.Ifnoneoftheseoptionssoundappealing,therearepackagesin
Rthatwillallowimportingdatathatislargerthanthememoryavailablebydirectly
referringtothedataasit’sstoredonyourharddisk.Themostpopularoftheseseemtobe
ffandbigmemory.Thereisacosttothis,however;notonlyaretheoperationsslowerthan
theywouldbeiftheywereinmemory,butsincethedataisprocessedpiecemeal—in
chunks—manystandardRfunctionswon’tworkonthem.Bethatasitmay,theffbase
andthebiganalyticspackagesprovidemethodstorestoresomeofthefunctionalitylost
forthetwopackagesrespectively.Mostnotably,thesepackagesallowffandbigmemory
objectstobeusedinthebiglmpackage,whichcanbuildgeneralizedlinearmodelsusing
datathatistoobigtofitinthememory.
Note
biglmcanalsobeusedtobuildgeneralizedlinearmodelsusingdatastoredinadatabase!
RemembertheCRANTaskViewswetalkedaboutinthelastchapter?Thereisawhole
TaskViewdedicatedtoHighPerformanceComputing(https://cran.rproject.org/web/views/HighPerformanceComputing.html).Ifthereisaparticularstatistical
techniquethatyou’dliketofindanoptimizedalternativefor,thisisthefirstplaceI’d
check.
UsinganotherRimplementation
Risbothalanguageandanimplementationofthatlanguage.Sofar,whenwe’vebeen
talkingabouttheRenvironment/platform,we’vebeentalkingabouttheGNUProject
startedbyR.IhakaandR.GentlemenattheUniversityofAucklandin1993andhostedat
http://www.r-project.org.SinceRhasnostandardspecification,thiscanonical
implementationservesasR’sdefactospecification.Ifaprojectisabletoimplementthis
specification—andrewritetheGNU-Rfunctionality-for-functionalityandbug-for-bug—
anyvalidRcodecanberunonthatimplementation.
Sometimearound2009,variousotherimplementationofRstartedtocropup.Among
theseareRenjin(runningontheJavaVirtualMachine),pqR(whichstandsforPretty
QuickR,andwritteninamixofC,R,andFortran),FastR(whichiswritteninJava),and
Riposte(whichiswrittenmainlyinC++).Thesealternativeimplementationspromise
compellingimprovementstoGNU-R,suchasautomaticmultithreading(parallelization),
abilitytohandlelargerdata,andtighterintegrationwithJava.
Unfortunately,noneoftheseprojectsarecompleteasyet.Becauseofthis,noteverything
you’dexpecthasbeenimplemented;someofyourfavoritepackagesmaystopworking,
andbyandlarge,theseimplementationsaredifficulttoinstall.Forthesereasons,Iwould
onlyrecommendthisforveryadvancedusersand/orfortheextremelydesperate.
Althoughitdoesn’tqualifyasanotherRimplementation,thereisanotherRdistribution
thatisgainingpopularity—putoutbyacommercialenterprisenamedRevolution
Analytics—calledRevolutionREnterprise.Thisdistributionboastsautomatic
parallelizationforcertainrewrittenfunctions,improvedabilitytoworkonandmodel
datasetsthatwillnotfitinRAM(forcertainrewrittenfunctions),facilitiesfordistributed
computing,andtighterintegrationwithbigdatadatabases.ThisisapaiddistributionofR,
butyoucanituseforfreeifyouareastudentorforadiscountifyouworkinthenonprofitpublicservicesector.
RevolutionAnalyticsalsoputsoutafreealternativedistributionofRcalledRevolutionR
Open.Theprimarybenefitofthisdistribution,fromaperformanceperspective,istheease
withwhichitcanbeinstalledandusedwiththehighperformanceIntelMathKernel
Library(MKL).TheMKLisadrop-insubstituteforthelinearalgebralibrariesthatare
bundledautomaticallywithGNU-R.WhilethelinearalgebralibrarythatshipswithGNURissingle-threaded,theMKLcanexploitmultiplecorestransparently.Thismakes
computationslikematrixdecomposition,matrixinversion,andvectorizedmath(very
commonwhetherexplicitlyusedornot)muchfaster.
Beforewegoon,itshouldbenotedthatyoudon’thavetouseRevolutionROpentotake
advantageoftheMKLoranyothermulti-threadedlinearalgebralibrarieslike
OpenBLAS,ATLAS,andAccelerate(whichcomeswithOSXandisMaconly)—Idon’t.
However,linkingGNU-Rwiththeseotherlibrariescansometimesgetmessyandrequires
care.Interestedreaderscanfindinstructionsonhowtodothislinkingontheweb,mostly
intheformofblogpostsfromRenthusiasts.
Note
TheMacintoshversionofRevolutionROpen,bydefault,integrateswiththemultithreadedAccelerateframework,insteadofMKL.
Useparallelization
Aswesawinthischapter’sintroduction,oneofthelimitationsofR(andmostother
programminglanguages)wasthatitwascreatedbeforecommoditypersonalcomputers
hadmorethanoneprocessororcore.Asaresult,bydefault,Rrunsonlyoneprocessand,
thus,makesuseofoneprocessor/coreatatime.
IfyouhavemorethanonecoreonyourCPU,itmeansthatwhenyouleaveyourcomputer
aloneforafewhoursduringalongrunningcomputation,yourRtaskisrunningonone
corewhiletheothersareidle.Clearlythisisnotideal;ifyourRtasktookadvantageofall
theavailableprocessingpower,youcangetmassivespeedimprovements.
Parallelcomputation(ofthetypewe’llbeusing)worksbystartingmultipleprocessesat
thesametime.Theoperatingsystemthenassignseachoftheseprocessestoaparticular
CPU.Whenmultipleprocessesrunatthesametime,thetimetocompletionisonlyas
longasthelongestprocess,asopposedtothetimetocompletealltheprocessesadded
together.
Figure12.1:diagramoftheparallelizationandtheresultantreducedtimetocompletion
Forexample,let’ssaywehavefourprocessesinataskthattakes1secondtocomplete.
Withoutusingparallelization,thetaskwouldtake4seconds,butwithparallelizationon
fourcores,thetaskwouldtake1second.
Note
Awordofwarning:Thisistheidealscenario;butinpractice,thecostofstartingmultiple
processesconstitutesanoverheadthatwillresultinthetimetocompletionnotscaling
linearlywiththenumberofcoresused.
Allthissoundsgreat,butthere’sanimportantcatch;eachprocesshastobeabletorun
independentoftheoutputoftheotherprocesses.Forexample,ifwewroteanRprogram
tocomputethenthnumberintheFibonaccisequence,wecouldn’tdividethattaskupinto
smallerprocessestoruninparallel,becausethenFibonaccinumberdependsonwhatwe
computeasthen-1thFibonaccinumber(andsoon,adinfinitum).Theparallelizationof
thetypewe’llbeusinginthischapteronlyworksonproblemsthatcanbesplitupinto
processes,suchthattheprocessesdon’tdependoneachotherandthere’sno
communicationbetweenprocesses.Luckily,therearemanyproblemslikethisindata
analysis!Almostasluckily,Rmakesiteasytouseparallelizationonproblemsofthis
type!
Problemsofthenaturethatwejustdescribedaresometimesknownasembarrassingly
parallelproblems,becausetheentiretaskcanbebrokendownintoindependent
componentsveryeasily.Asanexample,summingthenumbersinanumericvectorof100
elementsisanembarrassinglyparallelproblem,becausewecaneasilysumthefirst50
elementsinoneprocessandthelast50inanother,inparallel,andjustaddthetwo
numbersattheendtogetthefinalsum.Thepatternofcomputationwejustdescribedis
sometimesreferredtoassplit-apply-combine,divideandconquer,ormap/reduce.
Note
Usingparallelizationtotackletheproblemofsumming100numbersissilly,sincethe
overheadofthesplittingandcombiningwilltakelongerthanitwouldtojustsumupall
the100elementsserially.Also,sumisalreadyreallyfastandvectorized.
GettingstartedwithparallelR
GettingstartedwithparallelizationinRrequiresminimalsetup,butthatsetupvariesfrom
platformtoplatform.Moreaccurately,thesetupisdifferentforWindowsthanitisfor
everyotheroperatingsystemthatRrunson(GNU/Linux,MacOSX,Solaris,*BSD,and
others).
Ifyouhavedon’thaveaWindowscomputer,allyouhavetodotostartistoloadthe
parallelpackage:
#Youdon'thavetoinstallthisifyourcopyofRisnew
library(parallel)
IfyouuseWindows,youcaneither(a)switchtothefreeoperatingsystemthatover97
percentofthe500mostpowerfulsupercomputersintheworlduse,or(b)runthe
followingsetupcode:
library(parallel)
cl<-makeCluster(4)
Youmayreplacethe4withhowevermanyprocessesyouwanttoautomaticallysplityour
taskinto.Thisisusuallysettothenumberofcoresavailableonyourcomputer.Youcan
queryyoursystemforthenumberofavailablecoreswiththefollowingincantation:
detectCores()
-----------------------[1]4
Ourfirstsilly(butdemonstrative)applicationofparallelizationisthetaskofsleeping
(makingaprogrambecometemporarilyinactive)for5seconds,fourdifferenttimes.We
candothisserially(not-parallel)asfollows:
for(iin1:4){
Sys.sleep(5)
}
Or,equivalently,usinglapply:
#lapplywillpasseachelementofthe
#vectorc(1,2,3,4)tothefunction
#wewritebutwe'llignoreit
lapply(1:4,function(i)Sys.sleep(5))
Let’stimehowlongthistasktakestocompletebywrappingthetaskinsidetheargument
tothesystem.timefunction:
system.time(
lapply(1:4,function(i)Sys.sleep(5))
)
---------------------------------------usersystemelapsed
0.0590.07420.005
Unsurprisingly,ittook20(4*5)secondstorun.Let’sseewhathappenswhenwerunthis
inparallel:
#######################
#NON-WINDOWSVERSION#
#######################
system.time(
mclapply(1:4,function(i)Sys.sleep(5),mc.cores=4)
)
###################
#WINDOWSVERSION#
###################
system.time(
parLapply(cl,1:4,function(i)Sys.sleep(5))
)
---------------------------------------usersystemelapsed
0.0210.0425.013
Checkthatout!5seconds!Justwhatyouwouldexpectiffourprocessesweresleepingfor
5secondsatthesametime!
Forthenon-windowscode,wesimplyusethemclapply(thenon-Windowsparallel
counterparttolapply)insteadoflapply,andpassinanotherargumentnamedmc.cores,
whichtellsmclapplyhowmanyprocessestoautomaticallysplittheindependent
computationinto.
Forthewindowscode,weuseparLapply(theWindowsparallelcounterparttolapply).
TheonlydifferencebetweenlapplyandparLapplythatwe’veusedhereisthat
parLapplytakestheclusterwemadewiththemakeClustersetupfunctionasitsfirst
argument.Unlikemclapply,there’snoneedtospecifythenumberofcorestouse,since
theclusterisalreadysetuptotheappropriatenumberofcores.
Note
BeforeRgotthebuilt-inparallelpackage,thetwomainpackagesthatallowedfor
parallelizationweremulticoreandsnow.multicoreusedamethodofcreatingdifferent
processescalledforkingthatwassupportedonallR-runningOSsexceptWindows.
Windowsusersusedthemoregeneralsnowpackagetoachieveparallelization.snow,
whichstandsforSimpleNetworkofWorkstations,notonlyworksonnon-Windows
computersaswellbutalsoonaclusterofdifferentcomputerswithidenticalR
installations.multicoredidnotsupportclustercomputingacrossphysicalmachineslike
snowdoes.
SinceRversion2.14,thefunctionalityofboththemulticoreandsnowpackageshave
essentiallybeenmergedintotheparallelpackage.Themulticorepackagehassince
beenremovedfromCRAN.
Fromnowon,whenwerefertotheWindowscounterparttoX,knowthatwereallymean
thesnowcounterparttoX,becausethefunctionsofsnowwillworkonnon-WindowsOSs
andclustersofmachines.Similarly,bythenon-Windowscounterparts,wereallymeanthe
counterpartscannibalizedfromthemulticorepackage.
Youwouldask,Whydon’twejustalwaysusethesnowfunctions?Ifyouhavetheoption
tousethemulticore/forkingparallelism(youarerunningprocessesonjustonenonWindowsphysicalmachine),themulticoreparallelismtendstobelight-weight.For
example,sometimesthecreationofasnowclusterwithmakeClustercansetofffirewall
alerts.Itissafetoallowtheseconnections,bytheway.
Anexampleof(some)substance
Forourfirstrealapplicationofparallelization,wewillbesolvingaproblemthatisloosely
basedonarealproblemthatIhadtosolveduringthecourseofmywork.Inthis
formulation,wewillbeimportinganopendatasetfromthewebthatcontainstheairport
code,latitudecoordinates,andlongitudecoordinatesfor13,429USairports.Ourtaskwill
betofindtheaverage(mean)distancefromeveryairporttoeveryotherairport.For
example,ifLAX,ALB,OLM,andJFKweretheonlyextantairports,wewouldcalculate
thedistancesbetweenJFKtoOLM,JFKtoALB,JFKtoLAX,OLMtoALB,OLMto
LAX,andALBtoLAX,andtakethearithmeticmeanofthesedistances.
Whyarewedoingthis?Besidesthefactthatitwasinspiredbyanactual,reallifeproblem
—andthatIcoveredthisveryprobleminnofewerthanthreeblogposts—thisproblemis
perfectforparallelizationfortworeasons:
Itisembarrassinglyparallel—Thisproblemisveryamenabletosplitting-applyingand-combining(ormap/reduction);eachprocesscantakeafew(severalhundreds,
really)oftheairport-to-airportcombinations,theresultscanthenbesummedand
dividedbythenumberofdistancecalculationsperformed.
Itexhibitscombinatorialexplosion—Thetermcombinatorialexplosionrefersto
theproblemsthatgrowveryquicklyinsizeorcomplexityduetotheroleof
combinatoricsintheproblem’ssolution.Forexample,thenumberofdistance
calculationswehavetoperformexhibitspolynomialgrowthasafunctionofthe
numberofairportsweuse.Inparticular,thenumberofdifferentcalculationsisgiven
bythebinomialcoefficient,
,orn(n-1)/2.100airportsrequire4,950distance
calculations;all13,429airportsrequire90,162,306distancecalculations.Problemsof
thistypeusuallyrequiretechniqueslikethosediscussedinthischapterinordertobe
computationallytractable.
Note
Thebirthdayproblem:Mostpeopleareunfazedbythefactthatittakesaroomof367to
guaranteethattwopeopleintheroomhavethesamebirthday.Manypeoplearesurprised,
however,whenitisrevealedthatitonlyrequiresaroomfullof23peoplefortheretobea
50percentchanceoftwopeoplesharingthesamebirthday(assumingthatbirthdaysoccur
oneachdaywithequalprobability).Further,itonlytakesaroomfullof60fortheretobe
overa99percentchancethatapairwillshareabirthday.Ifthissurprisesyoutoo,
considerthatthenumberofpairsofpeoplethatcouldpossiblysharetheirbirthdaygrows
polynomiallywiththenumberofpeopleintheroom.Infact,thenumberofpairsthatcan
shareabirthdaygrowsjustlikeourairportproblem—thenthenumberofbirthdaypairsis
exactlythenumberofdistancecalculationswewouldhavetoperformifthepeoplewere
airports.
First,let’swritethefunctiontocomputethedistancebetweentwolatitude/longitudepairs.
SincetheEarthisn’tflat(strictlyspeaking,it’snotevenaperfectsphere),thedistance
betweenthelongitudeandlatitudedegreesisnotconstant—meaning,youcan’tjusttake
theEuclideandistancebetweenthetwopoints.WewillbeusingtheHaversineformula
forthedistancesbetweenthetwopoints.TheHaversineformulaisgivenasfollows:
whereϕandλarethelatitudeandlongituderespectively,ristheEarth’sradius,andΔis
thedifferencebetweenthetwolatitudesorlongitudes.
haversine<-function(lat1,long1,lat2,long2,unit="km"){
radius<-6378#radiusofEarthinkilometers
delta.phi<-to.radians(lat2-lat1)
delta.lambda<-to.radians(long2-long1)
phi1<-to.radians(lat1)
phi2<-to.radians(lat2)
term1<-sin(delta.phi/2)^2
term2<-cos(phi1)*cos(phi2)*sin(delta.lambda/2)^2
the.terms<-term1+term2
delta.sigma<-2*atan2(sqrt(the.terms),sqrt(1-the.terms))
distance<-radius*delta.sigma
if(unit=="km")return(distance)
if(unit=="miles")return(0.621371*distance)
}
Everythingmustbemeasuredinradians(notdegrees),solet’smakeahelperfunctionfor
conversiontoradians,too:
to.radians<-function(degrees){
degrees*pi/180
}
Nowlet’sloadthedatasetfromtheweb.Sinceit’sfromanoutsidesourceanditmightbe
messy,thisisanexcellentchancetouseourassertrchopstomakesuretheforeigndata
setmatchesourexpectations:thedatasetis13,429observationslong,ithasthreenamed
columns,thelatitudeshouldbe90orbelow,andthelongitudeshouldbe180orbelow.
We’llalsojuststartwithasubsetofalltheairports.Becausewearegoingtobetakinga
randomsampleofalltheobservations,we’llsettherandomnumbergeneratorseedsothat
mycalculationswillalignwithyours,dearreader.
set.seed(1)
the.url<-"http://opendata.socrata.com/api/views/rxrh-4cxm/rows.csv?
accessType=DOWNLOAD"
all.airport.locs<-read.csv(the.url,stringsAsFactors=FALSE)
library(magrittr)
library(assertr)
CHECKS<-.%>%
verify(nrow(.)==13429)%>%
verify(names(.)%in%c("locationID","Latitude","Longitude"))%>%
assert(within_bounds(0,90),Latitude)%>%
assert(within_bounds(0,180),Longitude)
all.airport.locs<-CHECKS(all.airport.locs)
#Let'sstartoffwith400airports
smp.size<-400
#choosearandomsampleofairports
random.sample<-sample((1:nrow(all.airport.locs)),smp.size)
airport.locs<-all.airport.locs[random.sample,]
row.names(airport.locs)<-NULL
head(airport.locs)
------------------------------------locationIDLatitudeLongitude
1LWV38.764287.6056
2LS7730.727291.1486
32N243.591971.7514
4VG0037.369775.9469
Nowlet’swriteafunctioncalledsingle.corethatcomputestheaveragedistance
betweeneverytwopairsofairportsnotusinganyparallelcomputation.Foreachlat/long
pair,weneedtofindthedistancebetweenitandtherestofthelat/longspairs.Sincethe
distancebetweenpointaandbisthesameasthedistancebetweenbanda,foreveryrow,
weneedonlycomputethedistancebetweenitandtheremainingrowsinthe
airport.locsdataframe:
single.core<-function(airport.locs){
running.sum<-0
for(iin1:(nrow(airport.locs)-1)){
for(jin(i+1):nrow(airport.locs)){
#iistherowofthefirstlat/longpair
#jistherowofthesecondlat/longpair
this.dist<-haversine(airport.locs[i,2],
airport.locs[i,3],
airport.locs[j,2],
airport.locs[j,3])
running.sum<-running.sum+this.dist
}
}
#Nowwehavetodividebythenumberof
#distanceswetook.Thisisgivenby
return(running.sum/
((nrow(airport.locs)*(nrow(airport.locs)-1))/2))
}
Now,let'stimeit!
system.time(ave.dist<-single.core(airport.locs))
print(ave.dist)
----------------------------
usersystemelapsed
5.4000.0345.466
[1]1667.186
Allright,5andahalfsecondsfor400airports.
Inordertousetheparallelsurrogatesforlapply,let’srewritethefunctiontouselapply.
Observetheoutputofthefollowingincantation:
#We'llhavetolimittheoutputtothe
#first11columns
combn(1:10,2)[,1:11]
---------------------------------------[,1][,2][,3][,4][,5][,6][,7][,8][,9]
[1,]111111111
[2,]2345678910
[,10][,11]
[1,]22
[2,]34
Theprecedingfunctionusedthecombnfunctiontocreateamatrixthatcontainsallpairsof
twonumbersfrom1to10,storedascolumnsintworows.Ifweusethecombnfunction
withavectorofintegernumbersfrom1ton(wherenisthenumberofairportsinour
dataframe),eachcolumnoftheresultantmatrixwillrefertoallthedifferentindiceswith
whichtoindextheairportdataframeinordertoobtainallthepossiblepairsofairports.
Forexample,let’sgobacktotheworldwhereLAX,ALB,OLM,andJFKweretheonly
extantairports;considerthefollowing:
small.world<-c("LAX","ALB","OLM","JFK")
all.combs<-combn(1:length(small.world),2)
for(iin1:ncol(all.combs)){
from<-small.world[all.combs[1,i]]
to<-small.world[all.combs[2,i]]
print(paste(from,"<->",to))
}
---------------------------------------[1]"LAX<->ALB"
[1]"LAX<->OLM"
[1]"LAX<->JFK"
[1]"ALB<->OLM"#backtoolympia
[1]"ALB<->JFK"
[1]"OLM<->JFK"
Formulatingoursolutionaroundthismatrixofindices,wecanuselapplytoloopoverthe
columnsinthematrix:
small.world<-c("LAX","ALB","OLM","JFK")
all.combs<-combn(1:length(small.world),2)
#insteadofprintingeachairportpairinastring,
#we'llreturnthestring
results<-lapply(1:ncol(all.combs),function(x){
from<-small.world[all.combs[1,x]]
to<-small.world[all.combs[2,x]]
return(paste(from,"<->",to))
})
print(results)
------------------------[[1]]
[1]"LAX<->ALB"
[[2]]
[1]"LAX<->OLM"
[[3]]
[1]"LAX<->JFK"
........
Inourproblem,wewillbereturningnumericsfromtheanonymousfunctioninlapply.
However,becauseweareusinglapply,theresultswillbealist.Becausewecan’tcall
sumonalistofnumerics,wewillusetheunlistfunctiontoturnthelistintoavector.
unlist(results)
--------------------[1]"LAX<->ALB""LAX<->OLM""LAX<->JFK"
[4]"ALB<->OLM""ALB<->JFK""OLM<->JFK"
Wehaveeverythingweneedtorewritethesingle.corefunctionusinglapply.
single.core.lapply<-function(airport.locs){
all.combs<-combn(1:nrow(airport.locs),2)
numcombs<-ncol(all.combs)
results<-lapply(1:numcombs,function(x){
lat1<-airport.locs[all.combs[1,x],2]
long1<-airport.locs[all.combs[1,x],3]
lat2<-airport.locs[all.combs[2,x],2]
long2<-airport.locs[all.combs[2,x],3]
return(haversine(lat1,long1,lat2,long2))
})
return(sum(unlist(results))/numcombs)
}
system.time(ave.dist<-single.core.lapply(airport.locs))
print(ave.dist)
--------------------------------------usersystemelapsed
5.8900.0425.968
[1]1667.186
Thisparticularsolutionisalittlebitslowerthanoursolutionwiththedoubleforloops,
butit’sabouttopayenormousdividends;nowwecanuseoneoftheparallelsurrogates
forlapplytosolvetheproblem:
#######################
#NON-WINDOWSVERSION#
#######################
multi.core<-function(airport.locs){
all.combs<-combn(1:nrow(airport.locs),2)
numcombs<-ncol(all.combs)
results<-mclapply(1:numcombs,function(x){
lat1<-airport.locs[all.combs[1,x],2]
long1<-airport.locs[all.combs[1,x],3]
lat2<-airport.locs[all.combs[2,x],2]
long2<-airport.locs[all.combs[2,x],3]
return(haversine(lat1,long1,lat2,long2))
},mc.cores=4)
return(sum(unlist(results))/numcombs)
}
###################
#WINDOWSVERSION#
###################
clusterExport(cl,c("haversine","to.radians"))
multi.core<-function(airport.locs){
all.combs<-combn(1:nrow(airport.locs),2)
numcombs<-ncol(all.combs)
results<-parLapply(cl,1:numcombs,function(x){
lat1<-airport.locs[all.combs[1,x],2]
long1<-airport.locs[all.combs[1,x],3]
lat2<-airport.locs[all.combs[2,x],2]
long2<-airport.locs[all.combs[2,x],3]
return(haversine(lat1,long1,lat2,long2))
})
return(sum(unlist(results))/numcombs)
}
system.time(ave.dist<-multi.core(airport.locs))
print(ave.dist)
------------------------------usersystemelapsed
7.3630.2402.743
[1]1667.186
Beforeweinterprettheoutput,directyourattentiontothefirstlineoftheWindows
segment.Whenmclapplycreatesadditionalprocesses,theseprocessessharethememory
withtheparentprocess,andhaveaccesstoalltheparent’senvironment.WithparLapply,
however,theprocedurethatspawnsnewprocessesisalittledifferentandrequiresthatwe
manuallyexportallthefunctionsandlibrariesweneedtoloadontoeachnewprocess
beforehand.Inthisexample,weneedthenewworkerstohavethehaversineand
to.radiansfunctions.
Nowtotheoutputofthelastcodesnippet.OnmyMacintoshmachinewithfourcores,this
bringswhatoncewasa5.5secondaffairdowntoa2.7secondaffair.Thismaynotseem
likeabigdeal,butwhenweexpandandstarttoincludemorethanjust400airports,we
starttoseethemulticoreversionreallypayoff.
Todemonstratejustwhatwe’vegainedfromourhasslesinparallelizingtheproblem,Iran
thisonaGNU/Linuxcloudserverwith16cores,andrecordedthetimeittooktocomplete
thecalculationsfordifferentsamplesizeswith1,2,4,8,and16cores.Theresultsare
depictedinthefollowingimage:
Figure12.2:Therunningtimesfortheaverage-distance-between-all-airportstaskat
differentsamplesizesfor1,2,4,8,and16cores.Forreference,thedashedlineisthe4
coreperformancecurve,thetopmostcurveisthesinglecoreperformancecurve,andthe
bottommostcurveisthe16corecurve.
Itmaybehardtotellfromtheplot,buttheestimatedtimestocompletionforthetask
runningon1,2,4,8,and16coresare2.4hours,1.2hours,36minutes,19minutes,and
17minutesrespectively.UsingparallelizedRona4-coremachine—whichisnotan
uncommonsetupatthetimeofwriting—hasbeenabletoshaveafulltwohoursofthe
task’srunningtime!Notethediminishingmarginalreturnsonthenumberofcoresused;
thereisbarelyanydifferencebetweentheperformancesofthe8and16cores.C’estlavie.
UsingRcpp
ContrarytowhatIsometimesliketobelieve,thereareothercomputerprogramming
languagesthanjustR.R—andlanguageslikePython,Perl,andRuby—areconsidered
high-levellanguages,becausetheyofferagreaterlevelofabstractionfromcomputer
representationsandresourcemanagementthanthelower-levellanguages.Forexample,in
somelowerlevellanguages,youmustspecifythedatatypeofthevariablesyoucreateand
managetheallocationofRAMmanually—C,C++,andFortranareofthistype.
ThehighlevelofabstractionRprovidesallowsustodoamazingthingsveryquickly—
likeimportadataset,runalinearmodel,andplotthedataandregressionlineinnomore
than4linesofcode!Ontheotherhand,nothingquitebeatstheperformanceofcarefully
craftedlower-levelcode.Evenso,itwouldtakehundredsoflinesofcodetorunalinear
modelinalow-levellanguage,soalanguagelikethatisinappropriateforagileanalytics.
OnesolutionistouseRabstractionswhenwecan,andbeabletogetdowntolower-level
programmingwhereitcanreallymakealargedifference.Thereareafewpathsfor
connectingRandlower-levellanguages,buttheeasiestwaybyfaristocombineRand
C++withRcpp.
Note
Therearedifferencesinwhatisconsideredhigh-level.Forthisreason,youwillsometimes
seepeopleandtexts(mostlyoldertexts)refertoCandC++asahigh-levellanguage.The
samepeoplemayconsiderR,Python,andsoonasveryhigh-levellanguages.Therefore,
thelevelofalanguageissomewhatrelative.
Awordofwarningbeforewegoon:Thisisanadvancedtopic,andthissectionwill(out
ofnecessity)glossoversome(most)ofthefinerdetailsofC++andRcpp.Ifyou’re
wonderingwhetheradetailedreadingwillpayoff,it’sworthtakingapeekatthe
conclusionofthissectiontoseehowmanysecondsittooktocompletetheaveragedistance-between-all-airportstaskthatwouldhavetakenover2hourstocomplete
unoptimized.
Ifyoudecidetocontinue,youmustinstallaC++compiler.OnGNU/Linuxthisisusually
donethroughthesystem’spackagemanager.OnMacOSX,XCodemustbeinstalled;itis
availablefreeintheAppStore.ForWindows,youmustinstalltheRtoolsavailableat
http://cran.r-project.org/bin/windows/Rtools/.Finally,allusersneedtoinstalltheRcpp
package.Formoreinformation,consultsections1.2and1.3oftheRcppFAQ
(http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf).
Essentially,ourintegrationofRandC++isgoingtotaketheformofusrewritingcertain
functionsininC++,andcallingtheminR.Rcppmakesthisveryeasy;beforewediscuss
howtowriteC++code,let’slookatanexample.Putthefollowingcodeintoafile,and
nameitour_cpp_function.cpp:
#include<Rcpp.h>
//[[Rcpp::export]]
doublesquare(doublenumber){
return(pow(number,2));
}
Congratulations,you’vejustwrittenaC++program!Now,fromR,we’llreadtheC++
file,andmakethefunctionavailabletoR.Then,we’lltestoutournewfunction.
library(Rcpp)
sourceCpp("our_cpp_functions.cpp")
square(3)
-------------------------------[1]9
Thefirsttwolineswithtexthavenothingtodowithourfunction,perse.Thefirstlineis
necessaryforC++tointegratewithR.Thesecondline(//[[Rcpp::export]])tellsR
thatwewantthefunctiondirectlybelowittobeavailableforuse(exported)withinR.
Functionsthataren’texportedcanonlybeusedintheC++file,internally.
Note
The//isacommentinC++,anditworksjustlike#inR.C++alsohasanothertypeof
commentthatcanspanmultiplelines.Thesemultilinecommentsstartwith/*andend
with*/.
Throughoutthissection,we’llbeaddingfunctionstoour_cpp_functions.cppandresourcingthefilefromRtousethenewC++functions.
Thefollowingmodestsquarefunctioncanteachusalotaboutthedifferencesbetweenthe
C++codeandRcode.Forexample,theprecedingC++functionisroughlyequivalentto
thefollowinginR:
square<-function(number){
return(number^2)
}
Thetwodoublesdenotethatthereturnvalueandtheargumentrespectively,arebothof
datatypedouble.doublestandsfordoubleprecisionfloatingpointnumber,whichis
roughlyequivalenttoR’smoregeneralnumericdatatype.
Thesecondthingtonoticeisthatweraisenumberstopowersusingthepowfunction,
insteadofusingthe^operator,likeinR.Thisisaminorsyntacticaldifference.Thethird
thingtonoteisthateachstatementinC++endswithasemicolon.
Believeitornot,wenowhaveenoughknowledgetorewritetheto.radiansfunctionin
C++.
/*Addthis(andallothersnippetsthat
startwith"//[[Rcpp::export]]")
totheC++file,nottheRcode.*/
//[[Rcpp::export]]
doubleto_radians_cpp(doubledegrees){
return(degrees*3.141593/180);
}
#withgoeswithourRcode
sourceCpp("our_cpp_functions.cpp")
to_radians_cpp(10)
------------------------[1]0.174533
Incredibly,withthehelpofsomesearch-engine-fuoragoodC++reference,wecan
rewritethewholehaversinefunctioninC++asfollows:
//[[Rcpp::export]]
doublehaversine_cpp(doublelat1,doublelong1,
doublelat2,doublelong2,
std::stringunit="km"){
intradius=6378;
doubledelta_phi=to_radians_cpp(lat2-lat1);
doubledelta_lambda=to_radians_cpp(long2-long1);
doublephi1=to_radians_cpp(lat1);
doublephi2=to_radians_cpp(lat2);
doubleterm1=pow(sin(delta_phi/2),2);
doubleterm2=cos(phi1)*cos(phi2)
term2=term2*pow(sin(delta_lambda/2),2);
doublethe_terms=term1+term2;
doubledelta_sigma=2*atan2(sqrt(the_terms),
sqrt(1-the_terms));
doubledistance=radius*delta_sigma;
/*ifitisanything*but*kmitismiles*/
if(unit!="km"){
return(distance*0.621371);
}
return(distance);
}
Now,let’sre-sourceit,andtestit…
sourceCpp("our_cpp_functions.cpp")
haversine(51.88,176.65,56.94,154.18)
haversine_cpp(51.88,176.65,56.94,154.18)
---------------------------------------------[1]1552.079
[1]1552.079
AreyousurprisedtoseethatRandtheC++aresosimilar?
Theonlythingsthatareunfamiliarinthisnewfunctionarethefollowing:
theintdatatype(whichjustholdsaninteger)
thestd::stringdatatype(whichholdsastring,oracharactervector,inR
parlance)
theifstatement(whichisidenticaltoR’s)
Otherthanthosethings,thisisjustbuildinguponwhatwe’vealreadylearnedwiththefirst
function.
Ourlastmatterofbusinessistorewritethesingle.corefunctioninC++.Tobuildupto
that,let’sfirstwriteaC++functioncalledsum2thattakesanumericvectorandreturnsthe
sumofallthenumbers:
//[[Rcpp::export]]
doublesum2(Rcpp::NumericVectora_vector){
doublerunning_sum=0;
intlength=a_vector.size();
for(inti=0;i<length;i++){
running_sum=running_sum+a_vector(i);
}
return(running_sum);
}
Thereareafewnewthingsinthisfunction:
Wehavetospecifythedatatypeofallthevariables(includingfunctionarguments)in
C++,butwhat’sthedatatypeoftheRvectorthatwe’retopassintosum2?The
importstatementatthetopoftheC++fileallowsustousetheRcpp::NumericVector
datatype(whichdoesnotexistinstandardC++).
TogetthelengthofaNumericVector(likewewouldinRwiththelengthfunction),
weusethe.size()method.
TheC++forloopisalittledifferentthanitsRcounterpart.Towit,ittakesthree
fields,separatedbysemicolons;thefirstfieldinitializesacountervariable,the
secondfieldspecifiestheconditionsunderwhichtheforloopwillcontinue(we’ll
stopiteratingwhenourcounterindexisthelengthofthevector),andthethirdishow
weupdatethecounterfromiterationtoiteration(i++meansadd1toi).Allinall,
thisforloopisequivalenttoaforloopinRthatstartswithfor(iin1:length).
ThewaytosubscriptavectorinC++isbyusingparentheses,notbrackets.Wewill
alsobeusingparentheseswhenwestartsubscriptingmatrices.
Ateveryiteration,weusethecounterasanindexintotheNumericVector,andextractthe
currentelement,weupdatetherunningsumwiththecurrentelement,andwhentheloop
ends,wereturntherunningsum.
PleasenotebeforewegoonthatthefirstelementofanyvectorinC++isthe0thelement,
notthefirst.Forexample,thethirdelementofavectorcalledvictorisvictor[3]inR,
whereasitwouldbevictor(2)inC++.Thisiswhythesecondfieldoftheforloopisi<
lengthandnoti<=length.
Now,we’refinallyreadytorewritethesingle.corefunctionfromthelastsectionin
C++!
//[[Rcpp::export]]
doublesingle_core_cpp(Rcpp::NumericMatrixmat){
intnrows=mat.nrow();
intnumcomps=nrows*(nrows-1)/2;
doublerunning_sum=0;
for(inti=0;i<nrows;i++){
for(intj=i+1;j<nrows;j++){
doublethis_dist=haversine_cpp(mat(i,0),mat(i,1),
mat(j,0),mat(j,1));
running_sum=running_sum+this_dist;
}
}
returnrunning_sum/numcomps;
}
Nothinghereshouldbetoonew.Theonlytwonewcomponentsarethatwearetakinga
newdatatype,aRcpp::NumericMatrix,asanargument,andthatweareusing.nrow()to
getthenumberofrowsinamatrix.
Let’stryitout!WhenweusedtheRfunctionsingle.core,wecalleditwiththewhole
airportdata.frameasanargument.ButsincetheC++functiontakesamatrixof
latitude/longitudepairs,wewillsimplydropthefirstcolumn(holdingtheairportname)
fromtheairport.locsdataframe,andconvertwhat’sleftintoamatrix.
sourceCpp("our_cpp_functions.cpp")
the.matrix<-as.matrix(all.airport.locs[,-1])
system.time(ave.dist<-single_core_cpp(the.matrix))
print(ave.dist)
---------------------------------------usersystemelapsed
0.0120.0000.012
[1]1667.186
Okay,thetaskthatusedtotake5.5secondsnowtakeslessthanonetenthofasecond(and
theoutputsmatch,toboot!)Astoundingly,wecanperformthetaskonallthe13,429
airportsquiteeasilynow:
the.matrix<-as.matrix(all.airport.locs[,-1])
system.time(ave.dist<-single_core_cpp(the.matrix))
print(ave.dist)
------------------------------usersystemelapsed
12.3100.08012.505
[1]1869.744
UsingRcpp,ittakesamere12.5secondstocalculateandaverage90,162,306distances—
afeatthatwouldhavetakenevena16coreserver17minutestocomplete.
Besmarteraboutyourcode
InablogpostthatIpennedshowcasingtheperformanceofthistaskundervarious
optimizationmethods,Itookitforgrantedthatcalculatingthedistancesonthefulldataset
withtheunparallelized/un-Rcpp-edcodewouldbeamulti-houraffair—butIwas
seriouslymistaken.
Shortlyafterpublishingthepost,acleverRprogrammercommentedonitstatingthatthey
wereabletoslightlyreworkthecodesothattheserial/pure-Rcodetooklessthan20
secondstocompletewithallthe13,429observations.How?Vectorization.
single.core.improved<-function(airport.locs){
numrows<-nrow(airport.locs)
running.sum<-0
for(iin1:(numrows-1)){
this.dist<-sum(haversine(airport.locs[i,2],
airport.locs[i,3],
airport.locs[(i+1):numrows,2],
airport.locs[(i+1):numrows,3]))
running.sum<-running.sum+this.dist
}
return(running.sum/(numrows*(numrows-1)/2))
}
system.time(ave.dist<-single.core.improved(all.airport.locs))
print(ave.dist)
-----------------------------------------------------------------usersystemelapsed
15.5370.17315.866
[1]1869.744
Noteven16seconds.It’sworthfollowingwhatthiscodeisdoing.
Thereisonlyoneforloopthatismakingitsroundsdownthenumberofrowsinthe
airport.locsdataframe.Oneachiterationoftheforloop,itcallsthehaversine
functionjustonce.Thefirsttwoargumentsarethelatitudeandlongitudeoftherowthat
theloopison.Thethirdandfourtharguments,however,arethevectorsofthelatitudes
andlongitudesbelowthecurrentrow.Thisreturnsavectorofallthedistancesfromthe
currentairporttotheairportsbelowitinthedataset.Sincethehaversinefunctioncould
justaseasilytakevectorsinsteadofsinglenumbers,thereisnoneedforasecondforloop.
Sothehaversinefunctionwasalreadyvectorized,Ijustdidn’trealizeit.You’dthinkthat
thiswouldbeembarrassingforsomeonewhoprofessestoknowenoughaboutRtowritea
bookaboutit.Perhapsitshouldbe.ButIfoundoutthatoneofthebestwaystolearn
—especiallyaboutcodeoptimization—isthroughexperimentationandmakingmistakes.
Forexample,whenIstartedlearningaboutwritinghighperformanceRcodeforbothfun
andprofit,Imadequiteafewmistakes.Oneofmyfirstblunders/failedexperimentswas
withthisverytask;whenIfirstlearnedaboutRcpp,Iusedittotranslatetheto.radians
andhaversinefunctionsonly.HavingtheloopremaininRprovedtoonlygiveaslight
performanceedge—nothingcomparedtothe12.5secondbusinesswe’veachieved
together.NowIknowthatthebulkoftheperformancedegradationwasduetothemillions
offunctioncallstohaversine—nottheactualcomputationinthehaversinefunction.
Youcouldlearnthatandotherlessonsmosteffectivelybycontinuingtotryandmessing
uponyourown.
Themoralofthestory:whenyouthinkyou’vevectorizedyourcodeenough,find
someonesmarterthanyoutotellyouthatyou’rewrong.
Exercises
Practicethefollowingexercisestorevisetheconceptslearnedsofar:
Ismultipleimputationamenabletoparallelcomputation?Whyorwhynot?
Howisthewaywecallto.radianswasteful?Isthereanywaytorefactorourcode
touseto.radiansinamoreefficientway?
WhenIwasgatheringthedatafromFigure12.2,Ididn’tcheckeverysamplesize
from1tothefulldataset;yet,I’veobtainedasmoothcurve.WhatIdidwastestthe
performanceofahandfulofsamplesizesfrom100toonly2,000.ThenIusednls
(non-linearleastsquares)tofitanequationoftheform
(wherenisthesample
size)tothedatapoints,andextrapolatedwiththisequationaftersolvingforx.What
aresomebenefitsanddrawbacksofthisapproach?Dothisonyourownmachine,if
applicable.Doyourperformancecurvesmatchmine?
ThereisathoughtamongsomescholarsthatthereisanincongruencebetweenAdam
Smith’stwoSeminalWorks,TheWealthofNationsandTheTheoryofMoral
Sentiments,namelythatthepreoccupationofself-interestoftheformerisatodds
withthestressplacedontheroleofwhatSmithreferredtoassympathy(caringfor
thewell-beingofothers)inguidingmoraljudgmentsinthelatter.Whyarethese
scholarswrong?
Summary
Webeganthischapterbyexplainingsomeofthereasonswhylargedatasetssometimes
presentaproblemforunoptimizedRcode,suchasnoauto-parallelizationandnonative
supportforout-of-memorydata.Fortherestofthechapterwediscussedspecificroutesto
optimizingRcodeinordertotacklelargedata.
First,youlearnedofthedangersofoptimizingcodetooearly.Next,wesaw—muchtothe
reliefofslackerseverywhere—thattakingthelazywayout(andbuyingorrentingamore
powerfulmachine)isoftenthemorecost-effectivesolution.
Afterthat,wesawthatalittleknowledgeaboutthedynamicsofmemoryallocationand
vectorizationinRcanoftengoalongwayinperformancegains.
ThenexttwosectionsfocusedlessonchangingourRcodeandmoreonchanginghowwe
useourcode.Specifically,wediscoveredthatthereareoftenperformancegainstobehad
byjustchangingthepackagesweuseand/orourimplementationoftheRlanguage.
Inanothersection,youlearnedhowparallelizationworksandwhat“embarrassing
parallel”problemsare.Thenwerestructuredthecodesolvingareal-worldproblemto
employparallelization.YoulearnedhowtodothisforbothWindowsandnon-Windows
systems,andsawtheperformancegainsyoumightexpecttoseewhenyouparallelize
embarrassinglyparallelproblems.
Afterthat,wesolvedthesameexamplefromthelastsectionusingRcppandsawthat:
ConnectingRandC++doesn’thavetobeasscaryasitsounds
Theperformanceoftenblowsallotheralternativesoutofthewater.
WeconcludewithaparablethatsuggeststhatlearninghowtowriteperformantRcodeis
ajourneyandanartratherthanatopicthatcanbemasteredatonce.
Chapter13.ReproducibilityandBest
Practices
Atthecloseofsomeprogrammingtexts,theuser,nowknowingtheintricaciesofthe
subjectofthetext,isneverthelessbewilderedonhowtoactuallygetstartedwithsome
seriousprogramming.Veryoften,discussionofthetooling,environment,andthelike—
thethingsthatinveterateprogrammersoflanguagextakeforgranted—areleftforthe
readertofigureoutontheirown.
TakeR,forexample—whenyouclickontheRicononyoursystem,aratherSpartan
windowwithatext-basedinterfaceappearsimploringyoutoentercommands
interactively.AreyoutoprogramRinthismanner?Bytypingcommandsone-at-a-time
intothiswindow?Thiswas,moreorless,permissibleupuntilthispointinthebook,butit
justwon’tcutitwhenyou’reoutthereonyourown.Foranykindofseriouswork—
requiringthererunningofanalyseswithmodifications,andsoon—youneedknowledge
ofthetoolsandtypicalworkflowsthatprofessionalRprogrammersuse.
Tonotleaveyouinthisunenviablepositionofnotknowinghowtogetstarted,dear
reader,wewillbegoingthroughawholechapter’sworthofinformationontypical
workflowsandcommon/bestpractices.
Youmayhavealsonoticed(viatheenormoustextatthetopofthispage)thatthesubject
discussedinthepreviousparagraphsissharingthespotlightwithreproducibility.What’s
this,then?
Reproducibilityistheabilityforyou,oranindependentparty,torepeatastudy,
experiment,orlineofinquiry.Thisimpliesthepossessionofalltherelevantandnecessary
materialsandinformation.Itisoneoftheprincipaltenetsofscientificinquiry.Ifastudyis
notreplicable,itissimplynotscience.
Ifyouareascientist,youarelikelyalreadyawareofthevirtuesofreproducibility(ifnot,
shameonyou!).Ifyou’reanon-scientistdataanalyst,thereisgreatmeritinyourtaking
reproducibilityseriously,too.Forone,startingananalysiswithreproducibilityinmind
requiresaleveloforganizationthatmakesyourjobawholeloteasier,inthemediumand
longrun.Secondly,thepersonwhoislikelygoingtobereproducingyouranalysesthe
mostisyou;doyourselfafavor,andtakereproducibilityseriouslysothatwhenyouneed
tomakechangestoananalysis,alteryourpriors,updateyourdatasource,adjustyour
plotsandfigures,orrollbacktoanestablishedcheckpoint,youmakethingseasieron
yourself.Lastly—andtruetotheintendedspiritofreproducibility—itmakesformore
reliableandtrustworthydisseminationofinformation.
Bytheway,allthesebenefitsstillholdevenifyouareworkingforaprivate(orotherwise
confidential)enterprise,wheretheanalysesarenottoberepeatedorknownaboutoutside
oftheinstitution.Theabilityofyourcoworkerstofollowthenarrativeofyouranalysisis
invaluable,andcangiveyourfirmacompetitiveedge.Additionally,theabilityfor
supervisorstotrackandaudityourprogressishelpful—ifyou’rehonest.Finally,keeping
youranalysesreproduciblewillmakeyourcoworkers’livesmucheasierwhenyoufinally
dropeverythingtogoliveonthehighseas.
Anyway,wearetalkingaboutbestpracticesandreproducibilityinthesamechapter
becauseoftheintimaterelationshipbetweenthetwogoals.Moreexplicitly,itisbest
practiceforyourcodetobeasreproducibleaspossible.
Bothreproducibilityandbestpracticesarewideanddiversetopics,buttheinformationin
thischaptershouldgiveyouagreatstartingpoint.
RScripting
TheabsolutefirstthingyoushouldknowaboutstandardRworkflowsisthatprogramsare
notgenerallywrittendirectlyattheinteractiveRinterpreter.Instead,Rprogramsare
usuallywritteninatextfile(witha.ror.Rfileextension).Theseareusuallyreferredto
asRscripts.Whenthesescriptsarecompleted,thecommandsinthistextfileareusually
executedallatonce(we’llgettoseehow,soon).Duringdevelopmentofthescript,
however,theprogrammerusuallyexecutesportionsofthescriptinteractivelytoget
feedbackandconfirmproperbehavior.ThisinteractivecomponenttoRscriptingallows
forbuildingeachcommandorfunctioniteratively.
I’veknownsomeseriousRprogrammerswhocopyandpastefromtheirfavoritetext
editorintoaninteractiveRsessiontoachievethiseffect.Tomostpeople,particularly
beginners,thebettersolutionistouseaneditorthatcansendRcodefromthescriptthatis
activelybeingwrittentoaninteractiveRconsole,line-by-line(orblock-by-block).This
providesaconvenientmechanismtoruncode,getfeedback,andtweakcode(ifneedbe)
withouthavingtoconstantlyswitchwindows.
Ifyou’reauserofthevenerableVimeditor,youmayfindthattheVim-R-pluginachieves
thisnicely.IfyouusetheequallyreveredEmacseditor,youmayfindthatEmacsSpeaks
Statistics(ESS)accomplishesthisgoal.Ifyoudon’thaveanycompellingreasonnotto,
though,IstronglysuggestyouuseRStudiotofillthisneed.RStudioisapowerful,free
IntegratedDevelopmentEnvironment(IDE)forR.NotonlydoesRStudiogiveyouthe
abilitytosendblocksofcodetobeevaluatedbytheRinterpreterasyouwriteyourscripts
butitalsoprovidesalltheaffordancesyou’dexpectfromthemostadvancedofIDEssuch
assyntaxhighlighting,aninteractivedebugger,codecompletion,integratedhelpand
documentation,andprojectmanagement.ItalsoprovidessomeveryhelpfulR-specific
functionalitylikeamechanismforvisualizingadataframeinmemoryasaspreadsheet
andanintegratedplotwindow.Lastly,itisverywidelyusedwithintheRcommunity,so
thereisanenormousamountofhelpandsupportavailable.
GiventhatRStudioissohelpful,someoftheremainderofthechapterwillassumeyouare
usingit.
RStudio
Firstthingsfirst—gotohttp://www.rstudio.com,andnavigatetothedownloadspage.
DownloadandinstalltheOpenSourceEditionoftheRStudioDesktopapplication.
WhenyoufirstopenRStudio,youmayonlyseethreepanes(asopposedtothefourpaned
windowsinFigure13.1).Ifthisisthecase,clickthebuttonlabeledeinFigure13.1,and
clickRScriptfromthedropdown.NowtheRStudiowindowshouldlookalotlikethe
onefromFigure13.1.
Thefirstthingyoushouldknowabouttheinterfaceisthatallofthepanelsservemore
thanonefunction.Thepanelabeledaisthesourcecodeeditor.Thiswillbethepane
whereinyouedityourRscripts.ThiswillalsoserveastheeditorpanelforLaTeX,C++,
orRMarkdown,ifyouarewritingthesekindsoffiles.Youcanworkonmultiplefilesat
thesametimeusingtabstoswitchfromdocumenttodocument.Panelawillalsoserveas
adataviewerthatwillallowyoutoviewdatasetsloadedinmemoryinaspreadsheet-like
manner.
PanelbistheinteractiveRconsole,whichisfunctionallyequivalenttotheinteractiveR
consolethatshippedwithRfromCRAN.Thispanewillalsodisplayotherhelpful
informationortheoutputofvariousgoings-oninsecondaryortertiarytabs.
Panelcallowsyoutoseetheobjectsthatyouhavedefinedinyourglobalenvironment.
Forexample,ifyouloadadatasetfromdiskortheweb,thenameofthedatasetwill
appearinthispanel;ifyouclickonit,RStudiowillopenthedatasetinthedataviewerin
panela.ThispanelalsohasatablabeledHistory,thatyoucanusetoviewRstatements
we’veexecutedinthepast.
Paneldisthemostversatileone;dependingonwhichofitstabsareopen,itcanbeafile
explorer,aplot-displayer,anRpackagemanager,andahelpbrowser.
Figure13.1:RStudio’sfour-panelinterfaceinMacOSX(version0.99.486)
ThetypicalRscriptdevelopmentworkflowisasfollows:Rstatements,expressions,and
functionsaretypedintotheeditorinpanela;statementsfromtheeditorareexecutedin
theconsoleinpanelbbyputtingthecursoronachosenlineandclickingtheRunbutton
(componentgfromthefigure),orbyselectingmultiplelinesandthenclickingtheRun
button.Iftheoutputsofanyofthesestatementsareplots,paneldwillautomatically
displaythese.Thescriptisnamedandsavedwhenthescriptiscomplete(or,preferably,
manytimeswhileyouarewritingit).
TolearnyourwayaroundtheRStudiointerface,writeanRscriptcallednothing.Rwith
thefollowingcontent:
library(ggplot2)
nothing<-data.frame(a=rbinom(1000,20,.5),
b=c("red","white"),
c=rnorm(1000,mean=100,sd=10))
qplot(c,data=nothing,geom="histogram")
write.csv(nothing,"nothing.csv")
Executethestatementsonebyone.Noticethatthehistogramisautomaticallydisplayedin
paneld.Afteryouaredone,typeandexecute?rbinomintheinteractiveconsole.Notice
howpanelddisplaysthehelppageforthisfunction?Finally,viewclickontheobject
labelednothingpanelcandinspectthedatasetinthedataviewer.
RunningRscripts
ThereareafewwaystorunsavedRscripts,likenothing.R.First—andthisisRStudio
specific—istoclickthebuttonlabeledSource(componenth).Thisisroughlyequivalent
tohighlightingtheentiredocumentandclickingRun.
Ofcourse,wewouldliketorunRscriptswithoutbeingdependentonRStudio.Oneway
todothisistousethesourcefunctionintheinteractiveRconsole—eitherRStudio’s
console,theconsolethatshipswithRfromCRAN,oryouroperatingsystem’scommand
promptrunningR.Thesourcefunctiontakesafilenameasit’sfirstandonlyrequired
argument.Thefilenamespecifiedwillbeexecuted,andwhenit’sdone,itwillreturnyou
tothepromptwithalltheobjectsfromtheRscriptnowinyourworkspace.Trythiswith
nothing.R;executingthels()commandafterthesourcefunctionendsshouldindicate
thatthenothingdataframeisnowinyourworkspace.Callingthesource()functionis
whathappensunderthehoodwhenyoupresstheSourcebuttoninRStudio.Ifyouhave
troublemakingthiswork,makesurethateither(a)youspecifythefullpathtothefile
nothing.Rinthesource()functioncall,or(b)youusesetwd()tomakethedirectory
containingnothing.Ryourcurrentworkingdirectory,beforeyouexecute
source("nothing.R").
Athird,lesspopularmethodistousetheRCMDBATCHcommandonyouroperating
system’scommand/terminalprompt.Thisshouldworkonallsystems,outofthebox,
exceptWindows,whichmayrequireyoutoaddtheRbinaryfolder(usually,something
like:C:\ProgramFiles\R\R-3.2.1\bin)toyourPATHvariable.Thereareinstructionson
howtoaccomplishthisontheweb.
Note
Yoursystem’scommandprompt(orterminalemulator)willdependonwhichoperating
systemyouuse.Windowusers’commandpromptiscalledcmd.exe(whichyoucanrunby
pressingWindows-key+R,typingcmd,andstrikingenter).Macintoshusers’terminal
emulatorisknownasTerminal.app,andisunder/Applications/Utilities.Ifyouuse
GNU/LinuxorBSD,youknowwheretheterminalis.
Usingthefollowingincantation:
RCMDBATCHnothing.R
Thiswillexecutethecodeinthefile,andautomaticallydirectit’soutputintoafilenamed
nothing.Rout,whichcanbereadwithanytexteditor.
Rmayhaveaskedyou,anytimeyoutriedtoquitR,whetheryouwantedtosaveyour
workplaceimage.SavingyourworkplaceimagemeansthatRwillcreateaspecialfilein
yourcurrentworkingdirectory(usuallynamed.RData)containingalltheobjectsinyour
currentworkspacethatwillbeautomaticallyloadedagainifyoustartRinthatdirectory.
ThisissuperusefulifyouareworkingwithRinteractivelyandyouwanttoexitR,butbe
abletopickupandwritewhereyouleftoffsomeothertime.However,thiscancause
issueswithreproducibility,sinceanotheruseRwon’thavethesame.RDatafileontheir
computer(andyouwon’thaveitwhenyourerunthesamescriptonanothercomputer).
Forthisreason,weuseRCMDBATCHwiththe--vanillaoption:
R--vanillaCMDBATCHnothing.R
whichmeansdon’trestorepreviouslysavedobjectsfrom.RData,don’tsavetheworkplace
imagewhentheRscriptisdonerunning,anddon’treadtheanyofthefilesthatcanstore
customRcodethatwillautomaticallyloadineachRsession,bydefault.Basically,this
amountstodon’tdoanythingthatwouldbeabletobereplicatedusinganothercomputer
andRinstallation.
Thefinalmethod—whichismypreference—istousetheRscriptprogramthatcomes
withrecentversionsofR.OnGNU/Linux,Macintosh,oranyotherUnix-likesystemthat
supportsR,thiswillautomaticallybeavailabletousefromthecommand/terminalprompt.
OnWindows,theaforementionedRbinaryfoldermustbeaddedtoyourPATHvariable.
UsingRscriptisaseasyastypingthefollowing:
Rscriptnothing.R
Or,ifyoucareaboutreproducibility(andyoudo!):
Rscript--vanillanothing.R
ThisisthewayIsuggestyourunRscriptswhenyou’renotusingRStudio.
Note
IfyouareusingaUnixorUnix-likeoperatingsystem(likeMacOSXorGNU/Linux),
youmaywanttoputalinelike#!/usr/bin/Rscript--vanillaasthefirstlineinyourR
scripts.Thisiscalledashebangline,andwillallowyoutorunyourRscriptsasaprogram
withoutspecifyingRscriptattheprompt.Formoreinformation,readthearticleShebang
(Unix)onWikipedia.
Anexamplescript
Here’sanexampleRscriptthatwewillbereferringtofortherestofthechapter:
#!/usr/bin/Rscript--vanilla
###########################################################
####
##nyc-sat-scores.R##
####
##Author:TonyFischetti##
##[email protected]##
####
###########################################################
##
##Aim:touseBayesiananalysistocompareNYC's2010
##combinedSATscoresagainsttheaverageofthe
##restofthecountry,which,accordingto
##FairTest.com,is1509
##
#workspacecleanup
rm(list=ls())
#options
options(echo=TRUE)
options(stringsAsFactors=FALSE)
#libraries
library(assertr)#fordatachecking
library(runjags)#forMCMC
#makesureeverythingisallsetwithJAGS
testjags()
#yep!
##readdatafile
#datawasretrievedfromNYCOpenDataportal
#directlink:https://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv?
accessType=DOWNLOAD
nyc.sats<-read.csv("./data/SAT_Scores_NYC_2010.csv")
#let'sgivethecolumnseasiernames
better.names<-c("id","school.name","n","read.mean",
"math.mean","write.mean")
names(nyc.sats)<-better.names
#thereare460rowsbutalmost700NYCschools
#wewill*assume*,then,thatthisisarandom
#sampleofNYCschools
#let'sfirstchecktheveracityofthisdata…
#nyc.sats<-assert(nyc.sats,is.numeric,
#n,read.mean,math.mean,write.mean)
#Itlookslikecheckfailedbecausethereare"s"sforsome
#rows.(??)Alookatthedatasetdescriptionsindicates
#thatthe"s"isforschools#with5orfewerstudents.
#Forourpurposes,let'sjustexcludethem.
#Thisisafunctionthattakesavector,replacesall"s"s
#withNAsandmakecovertsallnon-"s"sintonumerics
remove.s<-function(vec){
ifelse(vec=="s",NA,vec)
}
nyc.sats$n<-as.numeric(remove.s(nyc.sats$n))
nyc.sats$read.mean<-as.numeric(remove.s(nyc.sats$read.mean))
nyc.sats$math.mean<-as.numeric(remove.s(nyc.sats$math.mean))
nyc.sats$write.mean<-as.numeric(remove.s(nyc.sats$write.mean))
#Removeschoolswithfewerthan5testtakers
nyc.sats<-nyc.sats[complete.cases(nyc.sats),]
#CalculateatotalcombinedSATscore
nyc.sats$combined.mean<-(nyc.sats$read.mean+
nyc.sats$math.mean+
nyc.sats$write.mean)
#Let'sbuildaposteriordistributionofthetruemean
#ofNYChighschools'combinedSATscores.
#We'renotgoingtolookatthesummarystatistics,because
#wedon'twanttobiasourpriors
#Specifyastandardgaussianmodel
the.model<-"
model{
#priors
mu~dunif(0,2400)
stddev~dunif(0,500)
tau<-pow(stddev,-2)
#likelihood
for(iin1:theLength){
samp[i]~dnorm(mu,tau)
}
}"
the.data<-list(
samp=nyc.sats$combined.mean,
theLength=length(nyc.sats$combined.mean)
)
results<-autorun.jags(the.model,data=the.data,
n.chains=3,
monitor=c('mu','stddev'))
#ViewtheresultsoftheMCMC
print(results)
#PlottheMCMCdiagnostics
plot(results,plot.type=c("histogram","trace"),layout=c(2,1))
#Looksgood!
#Let'sextracttheMCMCsamplesofthemeanandgetthe
#boundsofthemiddle95%
results.matrix<-as.matrix(results$mcmc)
mu.samples<-results.matrix[,'mu']
bounds<-quantile(mu.samples,c(.025,.975))
#Weare95%surethatthetruemeanisbetween1197and1232
#Nowlet'splotthemarginalposteriordistributionforthemean
#oftheNYChighschools'combinedSATgradesanddrawthe95%
#percentcredibleinterval.
plot(density(mu.samples),
main=paste("PosteriordistributionofmeancombinedSAT",
"scoreinNYChighschools(2010)",sep="\n"))
lines(c(bounds[1],bounds[2]),c(0,0),lwd=3,col="red")
#Giventheresults,theSATscoresforNYChighschoolsin2010
#are*incontrovertibly*notatparwiththeaverageSATscoresof
#thenation.
There’reafewthingsI’dlikeyoutonoteaboutthisRscript,andit’sadherencetobest
practices.
First,thefilenameisnyc-sat-scores.R—notfoo.R,doit.R,oranyofthatnonsense;
whenyouarelookingthroughyourfilesinsixmonths,therewillbenoquestionabout
whatthefilewassupposedtodo.
Thesecondisthatcommentsaresprinkledliberallythroughouttheentirescript.These
commandsservetostatetheintentionsandpurposeoftheanalysis,separatesectionsof
code,andremindourselves(oranyonewhoisreading)wherethedatafilecamefrom.
Additionally,commentsareusedtoblockoutsectionsofcodethatwe’dliketokeepinthe
script,butwhichwedon’twanttoexecute.Inthisexample,wecommentedoutthe
statementthatcallsassert,sincetheassertionfails.Withthesecomments,anybody—
evenanRbeginner—canfollowalongwiththecode.
Thereareafewothermanifestationsofgoodpracticeondisplayinthisscript:indention
thataidsinfollowingthecodeflow,spacesandnew-linesthatenhancereadability,lines
thatarerestrictedtounder80characters,andvariableswithinformativenames(nofoo,
bar,orbaz).
Lastly,takenoteoftheremove.sfunctionweemployinsteadofcopy-and-pasting
ifelse(vec=="s",NA,…)fourtimes.Anangellosesitswingseverytimeyoucopy-andpastecode,sinceitisanotoriousvectorformistakes.
Scriptingandreproducibility
Putanycodethatisnotone-off,andismeanttoberunagain,inascript.Evenforone-off
code,youarebetteroffputtingitinascript,because(a)youmaybewrong(andoftenare)
aboutnotneedingtorunitagain,(b)itprovidesarecordofwhatyou’vedone(including,
perhaps,unnoticedbugs),and(c)youmaywanttousesimilarcodeatanothertime.
Scriptingenhancesreproducibility,becausenow,theonlythingsweneedtoreproducethis
lineofinquiryonanothercomputerarethescriptandthedatafile.Ifwedidn’tplaceall
thiscodeinascript,wewouldhavehadtocopyandpasteourinteractiveRconsole
history,whichisuglyandmessytosaytheabsoluteleast.
It’stimetocomecleanaboutafibItoldintheprecedingparagraph.Inmostcases,allyou
needtoreproducetheresultsarethedatafile(s)andtheRscript(s).Insomecases,
however,somecodeyou’vewrittenthatworksinyourversionofRmaynotworkon
anotherperson’sversionofR.Somewhatmorecommonisthatthecodeyouwrite,which
usesafunctionalityprovidedbyapackage,maynotworkonanotherversionofthat
package.
Forthisreason,it’sgoodpracticetorecordtheversionofRandthepackagesyou’re
using.YoucandothisbyexecutingsessionInfo(),andcopyingtheoutputandpastingit
intoyourRscriptatthebottom.Makesuretocommentalloftheselinesout,orRwill
attempttoexecutethemthenexttimethescriptisrun.Foraprettier/betteralternativeto
sessionInfo(),usethesession_info()functionfromthedevtoolspackage.Theoutput
ofdevtools::session_info()forourexamplescriptlookslikethis:
>devtools::session_info()
Sessioninfo--------------------------------settingvalue
versionRversion3.2.1(2015-06-18)
systemx86_64,darwin13.4.0
uiRStudio(0.99.486)
language(EN)
collateen_US.UTF-8
tzAmerica/New_York
date1969-07-20
Packages------------------------------------package*versiondatesource
assertr*1.0.02015-06-26CRAN(R3.2.1)
coda0.17-12015-03-03CRAN(R3.2.0)
devtools1.9.12015-09-11CRAN(R3.2.0)
digest0.6.82014-12-31CRAN(R3.2.0)
lattice0.20-332015-07-14CRAN(R3.2.0)
memoise0.2.12014-04-22CRAN(R3.2.0)
modeest2.12012-10-15CRAN(R3.2.0)
rjags3-152015-04-15CRAN(R3.2.0)
runjags*2.0.2-82015-09-14CRAN(R3.2.0)
Thepackagesthatweexplicitlyloadedaremarkedwithanasterisk;alltheotherpackages
listedarepackagesthatareusedbythepackagesweloaded.Itisimportanttonotethe
versionofthesepackages,too,astheycanpotentiallycausecross-version
irreproducibility.
Rprojects
Therearesome(rare)caseswhereasingleRscriptcontainsthetotalityofyour
research/analyses.Thismayhappenifyouaredoingsimulationstudies,forexample.For
mostcases,ananalysiswillconsistofascript(orscripts)andatleastonedataset.Irefer
toanyRanalysisthatusesatleasttwofilesasanRproject.
InRprojects,specialattentionmustbepaidtohowthefilesarestoredrelativetoeach
other.Forexample,ifwestoredthefileSAT_Scores_NYC_2010.csvonourdesktop,the
dataimportlinewouldhaveread:
read.csv("/Users/bensisko/Desktop/SAT_Scores_NYC_2010.csv")
Ifyouwanttosendthisanalysistoacontributortobereplicated,wewouldsendthemthe
scriptandthedatafile.Evenifweinstructedthemtoplacethefileontheirdesktop,the
scriptwouldstillnotbereproducible.OurcollaboratorsonWindowsandUnixwould
havetomanuallychangetheargumentofread.csvto
C:/Users/jameskirk/Desktop/SAT_Scores_NYC_2010.csvor
/home/katjaneway/Desktop/SAT_Scores_NYC_2010.csv,respectively.
Afarbetterwaytohandlethissituationistoorganizeallyourfilesinaneathierarchythat
willallowyoutospecifyrelativepathsforyourdataimports.Inthiscase,itmeans
makingafoldercalledsat-scores(orsomethinglikethat),whichcontainsthescriptnycsat-scores.RandafoldercalleddatathatcontainsthefileSAT_Scores_NYC_2010.csv:
Figure13.2:Asamplefile/folderhierarchyforanRanalysisproject
Thefunctioncallread.csv("./data/SAT_Scores_NYC_2010.csv")instructsRtoloadthe
datasetinsidethedatafolderinthecurrentworkingdirectory.Now,ifwewantedtosend
ouranalysistoacollaborator,wewouldjustsendthemthefolder(whichwecan
compress,ifwewant),anditwillworknomatterwhatourcollaborator’susernameand
operatingsystemis.Additionally,everythingisniceandneat,andinoneplace.Notethat
weputafilecalledREADME.txtintotherootdirectoryofourproject.Thisfilewould
containinformationabouttheanalysis,instructionsforrunningit,andsoon.Thisisa
commonconvention.
Anyway,neveruseabsolutepaths!
InprojectsthatusemorethanoneRscript,somechooseaslightlydifferentprojectlayout.
Forexample,let’ssaywedividedourprecedingscriptintoload-and-clean-sat-data.R
andanalyze-sat-data.R;wemightchooseafolderhierarchythatlookslikethis:
Figure13.3:Asamplefile/folderhierarchyforamultiscriptRanalysisproject
Underthisorganizationalparadigm,thetwoscriptsarenowplacedinafoldercalledcode,
andanewscriptmaster.Risplacedintheproject’srootdirectory.master.Riscalled
driverscript,anditwillcallourtwonon-driverscriptsintherightorder.Forexample,
master.Rmaylooklikethis:
#!/usr/bin/Rscript--vanilla
source("./code/load-and-clean-sat-data.R")
source("./code/analyze-sat-data.R")
Now,ourcollaboratorjusthastoexecutemaster.R,whichwill,inturn,executeour
analysisscripts.
Note
ThereareafewalternativestousinganRscriptasadriver.Onecommonalternativeisto
useashellscriptasadriver.Thesescriptscontaincodethatisrunbytheoperating
system’scommand-lineinterpreter.Adownsideofthisapproachisthatshellscriptsare,in
general,notportableacrosstheWindowsversusall-other-operating-systemsdivide.
Acommon,butsomewhatmoreadvancedalternative,istoreplacemaster.Rwitha
dependency-trackingbuildutilitylikemake,shake,sake,ordrake.Thisoffersahostof
benefitsincludingextensibilityandidentificationofredundantcomputations.
Versioncontrol
Averycompellingbenefittoourneathierarchicalorganizationschemeisthatitlends
itselftoeasyintegrationwithversioncontrolsystems.Versioncontrolsystems,atabasic
level,allowonetotrackchanges/revisionstoasetoffiles,andeasilyrollbacktoprevious
statesofthesetoffiles.
Asimple(andinadequate)approachistocompressyouranalysisprojectatregular
intervals,andpost-fixthefilenameofeachcompressedcopywithatimestamp.Thisway,
ifyoumakeamistake,andwouldliketoreverttoapreviousversion,allyouhavetodois
deleteyourcurrentprojectandun-compresstheprojectfromthetimeyouwanttoroll
backto.
Afarmoresanesolutionistousearemotefilesynchronizationservicethatfeatures
revisiontracking.ThemostpopularoftheseservicesatthetimeofwritingisDropbox,
thoughthereareotherssuchasTeamDriveandBox.Theseservicesallowyoutoupload
yourprojectintothecloud.Whenyoumakechangestoyourlocalcopy,theseservices
willtrackyourchanges,resynchronizetheremotelystoredcopy,andversionyourproject
foryou.Nowyoucanreverttoapreviousversionofjustonefile,insteadofhavingto
reverttheentireprojecthierarchy.
Note
Beware!Someoftheseserviceshavealimitonthenumberofrevisionstheytrack.Make
sureyoulookintothisfortheservicethatyouchoosetouse.
Agreatbenefitofusingoneoftheseservicesisthatanynumberofcollaboratorscanbe
invitedtoworkontheprojectsimultaneously.Youcanevensetpermissionsforthefiles
eachcollaboratorcanread/writeto.Theserviceyouchooseshouldbeabletotrackthe
changesmadebythecollaborators,too.
Perhaps,thesanestsolutionistouseanactualversioncontrolsystemlikeGit,Mercurial,
Subversion,orCVS.Thesearetraditionallyusedforsoftwareprojectsthatcontain
hundredsoffilesandmanymanycontributors,butit’sprovingtobeacrackerjacksolution
todataanalystswithjustafewfilesandlittletonoothercontributors.Thesealternatives
offerthemostflexibilityintermsofrollback,revisiontracking,conflict(incompatible
changes)resolution,compression,andmerging.ThecombinationofGitandGitHub(a
remoteGitrepositoryhostingservice)isprovingtobeaparticularlyeffectiveand
commonsolutiontostatisticalprogrammers.
Versioncontrolenhancesreproducibility—sinceallthechangestotheentireproject
(scripts/data/folder-structurelayouts)aredocumented,allthechangesarerepeatable.
Ifyourdatafilesaresmalltomedium,keepingtheminyourprojectwillplaynicelywith
yourversioncontrolsolution;itwillevenoffergreatbenefitsliketheassurancethatno
onetamperedwithyourdata.Ifyourdataistoolarge,though,youmightlookintoother
datastoragesolutionslikeremotedatabasestorage.
Note
Packageversionmanagement
SomeRanalysts,whorelyheavilyontheuseofadd-onCRANpackages,maychooseto
useatooltomanagethesepackagesandtheirversions.Thetwomostpopulartoolstodo
thisarethepackagepackratandcheckpoint.
packrat,whichisthemorepopularofthetwo,maintainsalibraryofthepackagesan
analysisusesinsidetheproject’srootdirectory.Thisallowstheanalysisandthepackages
itdependsontobeversioncontrolled.
checkpointallowsyoutousetheversionsofCRANpackagesastheywereona
particulardate.AnanalystwouldstorethedateoftheCRANsnapshotusedatthetopofa
script,andtheproperversionsofthesepackageswouldautomaticallydownloadona
collaborator’smachine.
Communicatingresults
Unlessananalysisisperformedsolelyforthepersonaledificationoftheanalyst,the
resultsaregoingtobecommunicated—eithertoteammates,yourcompany,yourlab,or
thegeneralpublic.SomeveryadvancedtechnologiesareinplaceforRprogrammersto
communicatetheirresultsaccuratelyandattractively.
Followingthepatternofsomeoftheothersectionsinthischapter,wewilltalkabouta
rangeofapproachesstartingwithabadalternativeandgiveanexplanationforwhyit’s
inadequate.
TheterriblesolutiontothecreatingofastatisticalreportistocopyRoutputintoaWord
document(orPowerPointpresentation)mixedwithprose.Whyisthisterrible?youask?
Becauseifonelittlethingaboutyouranalysischanges,youwillhavetore-copythenewR
outputintothedocument,manually.Ifyoudothisenoughtimes,it’snotamatterofifbut
amatterofwhenyouwillmessupandcopythewrongthing,orforgettocopythenew
output,andsoon.Thismethodjustopensuptoomanyvectorsformistakes.Additionally,
anytimeyouhavetomakeaslightchangetoaplot,updateadatasource,alterpriors,or
evenchangethenumberofmultipleimputationiterationstouse,itrequiresaherculean
effortonyourparttokeepthedocumentuptodate.
AllbettersolutionsinvolvehavingRdirectlyoutputthedocumentthatyouwilluseto
communicateyourresults.RStudio(alongwiththeknitrandrmarkdownpackages)
makesitveryeasyforyoutohaveyouranalysisspitoutapaperrenderedwithLaTeX,a
slideshowpresentation,oraself-containedHTMLwebpage.It’sevenpossibletohaveR
directlyoutputaWorddocument,whosecontentsaredynamicallycreatedusingRobjects.
Theleastattractive,buteasiestofthealternatives,istousetheCompileNotebook
functionfromtheRStudiointerface(thebuttonlabeledfinFigure13.1).Apop-upshould
appearaskingyouifyouwanttheoutputinHTML,PDF,oraWorddocument.Choose
oneandlookattheoutput.
Figure13.4:AnexcerptfromtheoutputofCompileNotebookonourexamplescript
Sure,thismaynotbetheprettiestdocumentintheworld,butatleastitcombinesourcode
(includingourinformativecomments)andresults(plots)inasingledocument.Further,
anychangetoourRscriptfollowedbyrecompilingthenotebookwillresultina
completelyupdateddocumentforsharing.It’salittlebitweirdtohaveournarrativetold
completelyviacomments,though,right?
Literateprogrammingisanovelprogrammingparadigmputforthbygeniuscomputer
scientistDonaldKnuth(whowementionedinthepreviouschapter).Thisapproach
involvesinterspersingcomputercodeandproseinthesamedocument.Whereasthe
CompileNotebookfeaturedoesn’tallowforprose(exceptincodecomments),the
RStudio/knitr/rmarkdownstackallowsforanapproachtoreportgenerationwherethe
prose/narrativeplaysamoreintegralpart.Tobegin,clicktheNewDocumentbutton
(componente),andchooseRMarkdown…fromthedropdown.Chooseatitlelike
example1inthepop-upwindow,leavethedefaultoutputformat,andpressOK.You
shouldseeadocumentwithsomeunfamiliarsymbolsintheeditor.Finally,clickthe
buttonlabeledKnitHTML(it’sthebuttonwiththecuteimageofaballofyarn),and
inspecttheoutput.
Gobacktotheeditorandre-readthecodethatproducedtheHTMLoutput.ThisisR
Markdown:alightweightmarkuplanguagewitheasy-to-rememberformattingsyntax
elementsandsupportfortheembeddedRcode.
Besidestheauto-generatedheader,thedocumentconsistsofaseriesoftwocomponents.
ThefirstofthecomponentsisstretchesofprosewritteninMarkdown.WithMarkdown,a
rangeofformattingoptionscanbewritteninplaintextthatcanberenderedinmany
differentoutputformats,likeHTMLandPDF.Theseformattingoptionsaresimple:
*This*producesitalictext;**this**producesboldtext.Forahandycheatsheetof
Markdownformattingoptions,clickthequestionmarkicon(whichappearswhenyouare
editingRMarkdown[.Rmd]documents),andchooseMarkdownQuickReferencefrom
thedropdown.
ThesecondcomponentissnippetsofRcodecalledchunks.Thesechunksareputbetween
twosetsofbackticks(```).Thesetofthreebackticksthatopenachunklooklike```{r}.
Betweenthecurlybraces,youcanoptionallynamethechunk,andyoucanspecifyany
numberofchunkoptions.Notethatinexample1.Rmd,thesecondchunkusestheoption
echo=FALSE;thismeansthatthecodesnippetplot(cars)willnotappearinthefinal
rendereddocument,eventhoughitsoutput(namely,theplot)will.
There’sanelementofRMarkdownthatIwanttocalloutexplicitly:inlineRcode.During
stretchesofprose,anytextbetween`rand`isevaluatedbytheRinterpreter,and
substitutedwithitsresultinthefinalrendereddocument.Withoutthismechanism,any
specificnumbers/informationrelatedtothedataobjects(likethenumberofobservations
inadataset)havetobehardcodedintotheprose.Whenthecodechanged,theonusof
visitingeachofthesehardcodedvaluestomakesuretheyareuptodatewasonthereport
author.UsinginlineRtooffloadthisupdatingontoReliminatesanentireclassof
commonmistakesinreportgeneration.
Whatfollowsisare-workingofourSATscriptinRMarkdown.Thiswillgiveusachance
tolookatthistechnologyinmoredetail,andgainanappreciationforhowitcanhelpus
achieveourgoalsofeasy-to-managereproducible,literateresearch.
--title:"NYCSATScoresAnalysis"
author:"TonyFischetti"
date:"November1,2015"
output:html_document
--####Aim:
TouseBayesiananalysistocompareNYC's2010
combinedSATscoresagainsttheaverageofthe
restofthecountry,which,accordingto
FairTest.com,is1509
```{r,echo=FALSE}
#options
options(echo=TRUE)
options(stringsAsFactors=FALSE)
```
Wearegoingtousethe`assertr`and`runjags`
packagesfordatacheckingandMCMC,respectively.
```{r}
#libraries
library(assertr)#fordatachecking
library(runjags)#forMCMC
```
Let'smakesureeverythingisallsetwithJAGS!
```{r}
testjags()
```
Great!
ThisdatawasfoundintheNYCOpenDataPortal:
https://nycopendata.socrata.com
```{r}
link.to.data<-"http://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv?
accessType=DOWNLOAD"
download.file(link.to.data,"./data/SAT_Scores_NYC_2010.csv")
nyc.sats<-read.csv("./data/SAT_Scores_NYC_2010.csv")
```
Let'sgivethecolumnseasiernames
```{r}
better.names<-c("id","school.name","n","read.mean",
"math.mean","write.mean")
names(nyc.sats)<-better.names
```
Thereare`rnrow(nyc.sats)`rowsbutalmost700NYCschools.Wewill,
therefore,*assume*thatthisisarandomsampleofNYCschools.
Let'sfirstchecktheveracityofthisdata…
```{r,error=TRUE}
nyc.sats<-assert(nyc.sats,is.numeric,
n,read.mean,math.mean,write.mean)
```
Itlookslikecheckfailedbecausethereare"s"sforsomerows.(??)
Alookatthedatasetdescriptionsindicatesthatthe"s"isforschools
with5orfewerstudents.Forourpurposes,let'sjustexcludethem.
Thisisafunctionthattakesavector,replacesall"s"s
withNAsandmakecovertsallnon-"s"sintonumerics
```{r}
remove.s<-function(vec){
ifelse(vec=="s",NA,vec)
}
nyc.sats$n<-as.numeric(remove.s(nyc.sats$n))
nyc.sats$read.mean<-as.numeric(remove.s(nyc.sats$read.mean))
nyc.sats$math.mean<-as.numeric(remove.s(nyc.sats$math.mean))
nyc.sats$write.mean<-as.numeric(remove.s(nyc.sats$write.mean))
```
Nowwearegoingtoremoveschoolswithfewerthan5testtakers
andcalculateacombinedSATscore
```{r}
nyc.sats<-nyc.sats[complete.cases(nyc.sats),]
#CalculateatotalcombinedSATscore
nyc.sats$combined.mean<-(nyc.sats$read.mean+
nyc.sats$math.mean+
nyc.sats$write.mean)
```
Let'snowbuildaposteriordistributionofthetruemeanofNYChigh
schools'combinedSATscores.We'renotgoingtolookatthesummary
statistics,becausewedon'twanttobiasourpriors.
Wewilluseastandardgaussianmodel.
```{r,cache=TRUE,results="hide",warning=FALSE,message=FALSE}
the.model<-"
model{
#priors
mu~dunif(0,2400)
stddev~dunif(0,500)
tau<-pow(stddev,-2)
#likelihood
for(iin1:theLength){
samp[i]~dnorm(mu,tau)
}
}"
the.data<-list(
samp=nyc.sats$combined.mean,
theLength=length(nyc.sats$combined.mean)
)
results<-autorun.jags(the.model,data=the.data,
n.chains=3,
monitor=c('mu'))
```
Let'sviewtheresultsoftheMCMC.
```{r}
print(results)
```
Nowlet'splottheMCMCdiagnostics
```{r,message=FALSE}
plot(results,plot.type=c("histogram","trace"),layout=c(2,1))
```
Looksgood!
Let'sextracttheMCMCsamplesofthemean,andgetthe
boundsofthemiddle95%
```{r}
results.matrix<-as.matrix(results$mcmc)
mu.samples<-results.matrix[,'mu']
bounds<-quantile(mu.samples,c(.025,.975))
```
Weare95%surethatthetruemeanisbetween
`rround(bounds[1],2)`and`rround(bounds[2],2)`.
Nowlet'splotthemarginalposteriordistributionforthemean
oftheNYChighschools'combinedSATgrades,anddrawthe95%
percentcredibleinterval.
```{r}
plot(density(mu.samples),
main=paste("PosteriordistributionofmeancombinedSAT",
"scoreinNYChighschools(2010)",sep="\n"))
lines(c(bounds[1],bounds[2]),c(0,0),lwd=3,col="red")
```
Giventheresults,theSATscoresforNYChighschoolsin2010
are**incontrovertibly**notatparwiththeaverageSATscoresof
thenation.
-----------------------------------Thisissomesessioninformationforreproducibility:
```{r}
devtools::session_info()
```
ThisRMarkdown,whenrenderedbyknittingtheHTML,lookslikethis:
Figure13.5:AnexcerptfromtheoutputofKnitHTMLonourexampleRMarkdown
document
Now,that’sahandsomedocument!
Afewthingstonote:First,ourcontextualnarrativeisnolongertoldthroughcode
comments;thenarrative,code,codeoutput,andplotsareallseparateandeasily
distinguished.Second,notethatboth,thenumberofobservationsinthedatasetandthe
boundsofourcredibleinterval,aredynamicallywovenintothefinaldocument.Ifwe
changeourpriors,oruseadifferentlikelihoodfunction(andweshould—seeexercise#3),
theboundsastheyappearinourfinalreportwillbeautomaticallyupdated.
Finally,takealookatthechunkoptionswe’veused.Wehidthecodeinourfirstchunkso
thatwedidn’tclutterthefinaldocumentwithoptionsetting.Inthesixthchunk,weused
theoptionerror=TRUEtolettherendererknowthatweexpectedthecontainedcodeto
fail.Theprintederrormessagenicelyillustrateswhywehadtospendthesubsequent
chunkondatacleaning.Intheninthchunk(theonewhereweruntheMCMCchains),we
usequiteafewoptions.cache=TRUEcachestheresultofthechunksothatifthechunk’s
codedoesn’tchange,wedon’thavetowaitforMCMCchainstoconvergeeverythingwe
renderthedocument.Weuseresults="hide"tohidetheverboseoutputof
autorun.jags.Weusewarning=FALSEtosuppressthewarningemittedbyautorun.jags
informingusthatwedidn’tchoosestartingvaluesforthechains.Lastly,weuse
message=FALSEtoquietthemessageproducedbyaautorun.jagsthattherjags
namespaceisautomaticallybeingloaded.autorun.jagssureischatty!
Wemayopttousedifferentchunkoptionsdependingonourintendedaudience.For
example,wecouldhidemoreofthecode—andfocusmoreontheoutputand
interpretation—ifwewerecommunicatingtheresultstoapartyofnon-statisticalprogrammers.Ontheotherhand,wewouldhidelessofthecodeifwewereusingthe
renderedHTMLasapedagogicaldocumenttoteachbuddingRprogrammershowtouse
RMarkdown.
TheHTMLthatisproducedcannowbeuploaded—asastandalonedocument—toaweb
serversothattheresultscanbesenttoothersasahyperlink.Bearinmind,too,thatweare
notlimitedtoknittingHTML;wecouldhavejustaseasilyknittedaPDForWord
document.WecouldhavealsousedRMarkdowntoproduceaslideshowpresentation—I
usethistechnologyallthetimeatwork.
Youdon’thavetonecessarilyuseRStudiotoproducethesehandsome,dynamicallygeneratedreports(theycanberenderedusingonlytheknitrandrmarkdownpackagesand
aformatconversionutilitycalledpandoc),butRStudiomakeswritingthemsoeasy,you
wouldneedareallycompellingreasontouseanyothereditor.
knitrisabeefypackageindeed,andweonlytouchedonthetipoftheiceberginregardto
whatitiscapableof;wedidn’tcover,forexample,customizingthereportswithHTML,
embeddingMathequationsintothereports,andusingLaTeX(insteadofRMarkdown)
forincreasedflexibility.Ifyouseethatpowerinknitr,anddynamically-generatedliterate
documentsingeneral,Iurgeyoutolearnmoreaboutit.
Exercises
Practicethefollowingexercisestorevisetheconceptofreproducibilitylearnedinthis
chapter:
Review:Whenwecreatedthedataframenothing,wecombinedavectorof1,000
binomiallydistributedrandomvariables,1,000normallydistributedrandom
variables,andavectoroftwocolors,redandwhite.Sinceallthecolumnsinadata
framehavetobethesamelength,howdidRallowthis?Whatisthepropertyof
vectorsthatallowsthis?
Seekout,read,andattempttounderstandthesourcecodeofsomeofyourfavoriteR
packages.Whatversioncontrolsystemistheauthorofthepackageusing?
Carefullyreviewtheanalysisthatwasusedasanexampleinthischapter.Inwhat
mannercanthisanalysisbeimprovedupon?Lookatthedistributionofthecombined
SATscoresinNYCschools.WhywasmodelingtheSATscoreswithaGaussian
likelihoodfunctiona(very)badchoice?Whatcouldwehavedoneinstead?
Ifbothapoorandarichpersonarewillingtobuyapairofsneakersfornomorethan
$40,whovaluesthesneakersthemost,andwhoshouldgetthesneakersinorderfor
thatresourcetobeallocatedmostefficiently?Couchyouranswerintermsofthe
diminishingmarginalutilityofmoney.Whatwouldthelawofdiminishingmarginal
utilitysayaboutthemostequitableincometaxschema,withrespecttodifferent
incomelevels?
Summary
Thislastchapter—whichwasuncharacteristicallylightontheory—maybeoneofthe
mostimportantchaptersinthewholebook.Inordertobeaproductivedataanalystusing
R,yousimplymustbeacquaintedwiththetoolsandworkflowsofprofessionalR
programmers.
Thefirsttopicwetouchedonwasthelinkbetweenbestpracticesandreproducibility,and
whyreproducibilityisanintegralpartofaproductiveandsaneanalyst’sworkflow.Next,
wediscussedthebasicsofRscripting,andhowtoruncompletedscriptsallatonce.We
sawhowRStudio—R’sbestIDE—canhelpuswhilewewritethesescriptsbyprovidinga
mechanismtoexecutecode,line-by-line,aswewriteit.Toreallycementyour
understandingofRscripting,wesawanexampleRscriptthatillustratedcleandesignand
adherencetobestpractices(informativevariablenames,readablelayout,myriad
informativecomments,andsoon.)
Then,youlearnedofafewwaysthatyoucanorganizemulti-fileanalysisprojects.You
sawhowthecorrectorganizationalstructureofanalysisprojectsnaturallylendthemselves
tointegrationwithversioncontrol—apowerfultoolintheorganizedanalyst’sutilitybelt.
Youlearnedhowthebenefitsconferredbyasophisticatedversioncontrolsystem—ability
toreverttopreviousversions,trackallrevisions,andmergeincompatiblerevisions—
couldpotentiallysaveananalystfromhoursofheartache.
Finally,yousawhowtousetheRStudio/knitr/rmarkdownstacktohelpusachieveour
goalsofproducingareproduciblereportofyouranalyses.Youlearnedthedangersofadhoc/copy-and-pastemanualreportgeneration,anddiscoveredthatabettersolutionisto
chargeRwithcreatingthereportitself.Thesimplestsolution—compilinganotebook—
was,atleast,betterthanmanualalternatives,butproducedreportsthatweresomewhat
lackingintheflexibilityandaestheticsdepartments.Yousawthat,instead,wecanuseR
Markdowntocreatefancy-pants,attractive,dynamically-generatedreportsthatcutdown
onerrors,complementreproducibility,andaidintheeffectivedisseminationof
information.
Index
A
alphalevel(αlevel)/NullHypothesisSignificanceTesting
AnalysisofCovariance(ANCOVA)/Logisticregression
AnalysisofVariance(ANOVA)
about/Testingmorethantwomeans
assumptions/AssumptionsofANOVA
anonymousfunctions/Functions
arguments/Arithmeticandassignment
arithmeticoperators/Arithmeticandassignment
assignmentoperators/Arithmeticandassignment
B
baggedtreestechnique/Randomforests
bagging/Randomforests
bandwidth/Probabilitydistributions
baseR/Visualizationmethods
batchmode/Navigatingthebasics
Bayesfactors/TheBayesianindependentsamplest-test
Bayesianindependentsamplest-test
performing/TheBayesianindependentsamplest-test
Bayesianlinearregression/Advancedtopics
bellcurve/Centraltendency
betalevel(βlevel)/Whenthingsgowrong
bias-variancetrade-off
about/Thebias-variancetrade-off
cross-validation/Cross-validation
balance,striking/Strikingabalance
binomialdistribution/NullHypothesisSignificanceTesting
bivariaterelationship(twovariable)/Multivariatedata
Bonferronicorrection/Testingmorethantwomeans
bootstrapaggregating/Randomforests
box-and-whiskerplot/Relationshipsbetweenacategoricalandacontinuousvariable
C
categoricalvariable
andcontinuousvariable,relationshipbetween/Relationshipsbetweena
categoricalandacontinuousvariable
relationships,describing/Relationshipsbetweentwocategoricalvariables
visualizationmethods/Categoricalandcontinuousvariables
centraltendency
measuring/Centraltendency
characterdatatype/Logicalsandcharacters
chi-squaredistribution/Testingindependenceofproportions
chi-squaredstatistic/Testingindependenceofproportions
circulardecisionboundary/Thecirculardecisionboundary
classifier
selecting/Choosingaclassifier
verticaldecisionboundary/Theverticaldecisionboundary
diagonaldecisionboundary/Thediagonaldecisionboundary
crescentdecisionboundary/Thecrescentdecisionboundary
circulardecisionboundary/Thecirculardecisionboundary
Cohen’sd/Don’tbefooled!
comments/Arithmeticandassignment
ComprehensiveRArchiveNetwork(CRAN)/Workingwithpackages
confidenceintervals
using/Intervalestimation
about/Howdidweget1.96?
confusionmatrix/Confusionmatrices
continuousvariable
andcategoricalvariable,relationshipbetween/Relationshipsbetweena
categoricalandacontinuousvariable
relationships,describing/Therelationshipbetweentwocontinuousvariables
covariance/Covariance
correlationcoefficients/Correlationcoefficients
multiplecorrelations,comparing/Comparingmultiplecorrelations
continuousvariables
visualizationmethods/Categoricalandcontinuousvariables
controlledexperiment/Testingtwomeans
correlationcoefficients/Correlationcoefficients
costcomplexitypruning/Decisiontrees
covariance/Covariance
covariancematrix/Comparingmultiplecorrelations
crescentdecisionboundary/Thecrescentdecisionboundary
cross-tabulation/Relationshipsbetweentwocategoricalvariables
crosstab/Relationshipsbetweentwocategoricalvariables
D
data
loading,inR/LoadingdataintoR,Workingwithpackages
dataformats
about/Otherdataformats
decisiontrees
about/Decisiontrees
degreesoffreedom/Populations,samples,andestimation
diagonaldecisionboundary/Thediagonaldecisionboundary
directionalhypothesis/Oneandtwo-tailedtests
discretenumericvariable/Univariatedata
E
EmacsSpeaksStatistics(ESS)/RScripting
ensemblelearning/Randomforests
estimation/Populations,samples,andestimation
F
flowofcontrolconstructs/Flowofcontrol
frequencydistributions
about/Frequencydistributions
examples/Frequencydistributions
functions/Functions
G
Gaussiandistribution/Centraltendency
GeneralizedLinearModel(GLM)/Logisticregression
ggplot2
about/Visualizationmethods
using/Visualizationmethods
H
hash-tag/Arithmeticandassignment
help.start()function/GettinghelpinR
Holm-Bonferronicorrection/Testingmorethantwomeans
I
ifelse()function/Advancedsubsetting
imputation
methods/Methodsofimputation
multipleimputation/Multipleimputationinpractice
independenceofproportions
statisticalsignificance/Don’tbefooled!
testing/Testingindependenceofproportions
independent/Testingtwomeans
independentsamplest-test
using/Testingtwomeans
assumptions/Assumptionsoftheindependentsamplest-test
indexing/Subsetting
IntegratedDevelopmentEnvironment(IDE)/RScripting
interactionterms/Advancedtopics
interquartilerange
using/Spread
intervalestimation
about/Intervalestimation
qnormfunction,using/Howdidweget1.96?
inverselinkfunction/Logisticregression
IterativelyRe-WeightedLeastSquares(IWLS)/Awordofwarning
J
JavascriptObjectNotation(JSON)
about/UsingJSON
JustifiedTrueBelief(JTB)/Exercises
K
k-NN
using,inR/Usingk-NNinR
limitations/Limitationsofk-NN
k-NN,usinginR
about/Usingk-NNinR
confusionmatrices/Confusionmatrices
kerneldensityestimation/Probabilitydistributions
kitchensinkregression
about/Kitchensinkregression
Kruskal-Wallistest/Whatifmyassumptionsareunfounded?
L
lambdafunctions/Functions
Last.fmdeveloper
URL/UsingJSON
left-tailed/Centraltendency
linearmodels
about/Linearmodels
linearregression,diagnostics
about/Linearregressiondiagnostics
Anscomberelationship,second/SecondAnscomberelationship
Anscomberelationship,third/ThirdAnscomberelationship
Anscomberelationship,fourth/FourthAnscomberelationship
linkfunction/Logisticregression
logicaldatatype/Logicalsandcharacters
logisticfunction/Logisticregression
logisticregression
about/Logisticregression
using/Logisticregression
using,inR/UsinglogisticregressioninR
limitations/UsinglogisticregressioninR
logitfunction/Logisticregression
M
machine/Usingabiggerandfastermachine
Mann-WhitneyUtest/Whatifmyassumptionsareunfounded?
matrix
creating/Matrices
about/Matrices
MaximumLikelihoodEstimation(MLE)/Logisticregression
meanheight
estimating/Estimatingmeans
MeanSquaredError(MSE)/Simplelinearregression
measuresofspread
forcategoricaldata/Spread
missingdata
analysis/Analysiswithmissingdata
visualizing/Visualizingmissingdata
types/Typesofmissingdata,Sowhichoneisit?
methods,fordealing/Unsophisticatedmethodsfordealingwithmissingdata
completecaseanalysis/Completecaseanalysis
pairwisedistribution/Pairwisedeletion
meansubstitution/Meansubstitution
hotdeckimputation/Hotdeckimputation
regressionimputation/Regressionimputation
stochasticregressionimputation/Stochasticregressionimputation
multipleimputation/Multipleimputation,Sohowdoesmicecomeupwiththe
imputedvalues?
out-of-boundsdata,checkingfor/Checkingforout-of-boundsdata
columndatatype,checking/Checkingthedatatypeofacolumn
unexpectedcategories,checking/Checkingforunexpectedcategories
outliers,checkingfor/Checkingforoutliers,entryerrors,orunlikelydata
points
entryerrors,checking/Checkingforoutliers,entryerrors,orunlikelydata
points
unlikelydatapoints,checking/Checkingforoutliers,entryerrors,orunlikely
datapoints
outliers,checking/Checkingforoutliers,entryerrors,orunlikelydatapoints
assertions,chaining/Chainingassertions
multiplecorrelations
comparing/Comparingmultiplecorrelations
multiplemeans
testing/Testingmorethantwomeans
multipleregression
about/Multipleregression
multivariatedata
about/Multivariatedata
MusicBrainz
URL/XML
N
negativelyskewed/Centraltendency
NHST
about/NullHypothesisSignificanceTesting
defaulthypothesis/NullHypothesisSignificanceTesting
one-tailedtest/Oneandtwo-tailedtests
two-tailedtests/Oneandtwo-tailedtests
TypeIerror/Whenthingsgowrong
TypeIIerror/Whenthingsgowrong
significance,warning/Awarningaboutsignificance
p-values,warning/Awarningaboutp-values
non-binarypredictor
regressionwith/Regressionwithanon-binarypredictor
non-linearmodeling/Advancedtopics
normaldistribution
about/Thenormaldistribution
three-sigmarule/Thethree-sigmaruleandusingz-tables
z-tables,using/Thethree-sigmaruleandusingz-tables
fitting,toprecipitationdataset/FittingdistributionstheBayesianway
NotaNumber(NaN)/Arithmeticandassignment
notavailable(NA)/Subsetting
nullhypothesisterminology/NullHypothesisSignificanceTesting
O
one-tailedhypothesistest
running/Testingthemeanofonesample
one-tailedtest/Oneandtwo-tailedtests
onesamplet-test
about/Testingthemeanofonesample
assumptions/Assumptionsoftheonesamplet-test
onlinerepositories
about/Onlinerepositories
OpenRefine/OpenRefine
optimizedpackages
using/Usingoptimizedpackages
optimizing
ways/Waittooptimize
Out-Of-Bag(OOB)/Randomforests
P
p-value
about/NullHypothesisSignificanceTesting
warning/Awarningaboutp-values
pairwiset-tests/Testingmorethantwomeans
parallelization
using/Useparallelization
parallelR/GettingstartedwithparallelR,Anexampleof(some)substance
parametricstatisticaltests/Whatifmyassumptionsareunfounded?
Pearson’scorrelation/Correlationcoefficients
polynomialregression/Thecirculardecisionboundary
population/Populations,samples,andestimation
positivelyskewed/Centraltendency
power/Whenthingsgowrong
predictfunction/Randomforests
probabilitydensityfunction(PDF)/Probabilitydistributions
probabilitydistributions
about/Probabilitydistributions
bandwidth,selecting/Probabilitydistributions
probabilitymassfunction(PMF)/Probabilitydistributions
pruning/Decisiontrees
Q
qnormfunction
using/Howdidweget1.96?
qplot(qplot)/Visualizationmethods
quantile/Howdidweget1.96?
quantile-quantileplot(QQ-plot)
using/Whatifmyassumptionsareunfounded?
R
R
about/Navigatingthebasics
arithmeticoperators/Arithmeticandassignment
assignmentoperators/Arithmeticandassignment
logicaldatatype/Logicalsandcharacters
characterdatatype/Logicalsandcharacters
flowofcontrolconstructs/Flowofcontrol
help,obtaining/GettinghelpinR
data,loading/LoadingdataintoR,Workingwithpackages
k-NN,using/Usingk-NNinR
logisticregression/UsinglogisticregressioninR
randomforests
about/Randomforests
rank
assigning/Correlationcoefficients
Rcode
about/Besmartaboutyourcode,Besmarteraboutyourcode,Exercises
memory,allocation/Allocationofmemory
vectorization/Vectorization
Rcpp
using/UsingRcpp
Read-Evaluate-Print-Loop(REPL)/Navigatingthebasics
recursivesplitting/Decisiontrees
regression/Correlationcoefficients
regularexpressions/Regularexpressions
regularization/Advancedtopics
relationaldatabase
about/RelationalDatabases
ResidualSumofSquares(RSS)/Simplelinearregression
results
communicating/Communicatingresults
right-tailed/Centraltendency
Rimplementation
using/UsinganotherRimplementation
rnormfunction/Estimatingmeans
RootMeanSquaredError(RMSE)/Simplelinearregression
Rprojects
about/Rprojects
RScripting
about/RScripting
RStudio/RStudio
running/RunningRscripts
Rscripts
running/RunningRscripts
example/Anexamplescript
RStudio
about/RStudio
S
samples/Populations,samples,andestimation
samplingdistribution/Thesamplingdistribution
scatterplot/Therelationshipbetweentwocontinuousvariables
scripting
andreproducibility/Scriptingandreproducibility
Shapiro-Wilktest/Whatifmyassumptionsareunfounded?
simplelinearregression
about/Simplelinearregression
withbinarypredictor/Simplelinearregressionwithabinarypredictor
warning/Awordofwarning
Simpson’sParadox/Relationshipsbetweentwocategoricalvariables
skewnessdegree/Centraltendency
smallersamples/Smallersamples
Spearman’srankcoefficient(rho)/Correlationcoefficients
splitpoint/Decisiontrees
spread
measuring/Spread
SQLquery/Whydidn’twejustdothatinSQL?
standarddeviation/Spread
standarderror/Thesamplingdistribution
subsetting/Subsetting
T
t-distribution(Student’st-distribution)
about/Smallersamples
t-test/Testingthemeanofonesample
teststatistic
defining/NullHypothesisSignificanceTesting
three-sigmarule/Thethree-sigmaruleandusingz-tables
tidyr/tidyr
trendline/Correlationcoefficients
Tukey’svariation/Relationshipsbetweenacategoricalandacontinuousvariable
tuningparameters/Decisiontrees
TypeIerrors/Whenthingsgowrong
TypeIIerrors/Whenthingsgowrong
U
univariatedata
about/Univariatedata
unsanitizeddata
analysis/Analysiswithunsanitizeddata
V
vectorizedfunctions/Vectorizedfunctions
vectors
about/Vectors
building/Vectors
subsetting/Subsetting
vectorizedfunctions/Vectorizedfunctions
advancedsubsetting/Advancedsubsetting
recycling/Recycling
versioncontrol
about/Versioncontrol
verticaldecisionboundary/Theverticaldecisionboundary
visualizationmethods
about/Visualizationmethods
ofcategoricaldata/Categoricalandcontinuousvariables
ofcontinuousvariables/Categoricalandcontinuousvariables
oftwocategoricaldata/Twocategoricalvariables
oftwocontinuousvariables/Twocontinuousvariables
ofmultiplecontinuousvariables/Morethantwocontinuousvariables
VisualizingCategoricalData(VSD)/Twocategoricalvariables
W
Wilcoxonrank-sumtest/Whatifmyassumptionsareunfounded?
X
xbar/Centraltendency
XML
about/XML
XPath
URL/XML
Z
z-scores/Correlationcoefficients
z-tables
using/Thethree-sigmaruleandusingz-tables