Note - IT eBooks
Transcript
Note - IT eBooks
DataAnalysiswithR TableofContents DataAnalysiswithR Credits AbouttheAuthor AbouttheReviewer www.PacktPub.com Supportfiles,eBooks,discountoffers,andmore Whysubscribe? FreeaccessforPacktaccountholders Preface Whatthisbookcovers Whatyouneedforthisbook Whothisbookisfor Conventions Readerfeedback Customersupport Downloadingtheexamplecode Downloadingthecolorimagesofthisbook Errata Piracy Questions 1.RefresheR Navigatingthebasics Arithmeticandassignment Logicalsandcharacters Flowofcontrol GettinghelpinR Vectors Subsetting Vectorizedfunctions Advancedsubsetting Recycling Functions Matrices LoadingdataintoR Workingwithpackages Exercises Summary 2.TheShapeofData Univariatedata Frequencydistributions Centraltendency Spread Populations,samples,andestimation Probabilitydistributions Visualizationmethods Exercises Summary 3.DescribingRelationships Multivariatedata Relationshipsbetweenacategoricalandacontinuousvariable Relationshipsbetweentwocategoricalvariables Therelationshipbetweentwocontinuousvariables Covariance Correlationcoefficients Comparingmultiplecorrelations Visualizationmethods Categoricalandcontinuousvariables Twocategoricalvariables Twocontinuousvariables Morethantwocontinuousvariables Exercises Summary 4.Probability Basicprobability Ataleoftwointerpretations Samplingfromdistributions Parameters Thebinomialdistribution Thenormaldistribution Thethree-sigmaruleandusingz-tables Exercises Summary 5.UsingDatatoReasonAbouttheWorld Estimatingmeans Thesamplingdistribution Intervalestimation Howdidweget1.96? Smallersamples Exercises Summary 6.TestingHypotheses NullHypothesisSignificanceTesting Oneandtwo-tailedtests Whenthingsgowrong Awarningaboutsignificance Awarningaboutp-values Testingthemeanofonesample Assumptionsoftheonesamplet-test Testingtwomeans Don’tbefooled! Assumptionsoftheindependentsamplest-test Testingmorethantwomeans AssumptionsofANOVA Testingindependenceofproportions Whatifmyassumptionsareunfounded? Exercises Summary 7.BayesianMethods ThebigideabehindBayesiananalysis Choosingaprior Whocaresaboutcoinflips EnterMCMC–stageleft UsingJAGSandrunjags FittingdistributionstheBayesianway TheBayesianindependentsamplest-test Exercises Summary 8.PredictingContinuousVariables Linearmodels Simplelinearregression Simplelinearregressionwithabinarypredictor Awordofwarning Multipleregression Regressionwithanon-binarypredictor Kitchensinkregression Thebias-variancetrade-off Cross-validation Strikingabalance Linearregressiondiagnostics SecondAnscomberelationship ThirdAnscomberelationship FourthAnscomberelationship Advancedtopics Exercises Summary 9.PredictingCategoricalVariables k-NearestNeighbors Usingk-NNinR Confusionmatrices Limitationsofk-NN Logisticregression UsinglogisticregressioninR Decisiontrees Randomforests Choosingaclassifier Theverticaldecisionboundary Thediagonaldecisionboundary Thecrescentdecisionboundary Thecirculardecisionboundary Exercises Summary 10.SourcesofData RelationalDatabases Whydidn’twejustdothatinSQL? UsingJSON XML Otherdataformats Onlinerepositories Exercises Summary 11.DealingwithMessyData Analysiswithmissingdata Visualizingmissingdata Typesofmissingdata Sowhichoneisit? Unsophisticatedmethodsfordealingwithmissingdata Completecaseanalysis Pairwisedeletion Meansubstitution Hotdeckimputation Regressionimputation Stochasticregressionimputation Multipleimputation Sohowdoesmicecomeupwiththeimputedvalues? Methodsofimputation Multipleimputationinpractice Analysiswithunsanitizeddata Checkingforout-of-boundsdata Checkingthedatatypeofacolumn Checkingforunexpectedcategories Checkingforoutliers,entryerrors,orunlikelydatapoints Chainingassertions Othermessiness OpenRefine Regularexpressions tidyr Exercises Summary 12.DealingwithLargeData Waittooptimize Usingabiggerandfastermachine Besmartaboutyourcode Allocationofmemory Vectorization Usingoptimizedpackages UsinganotherRimplementation Useparallelization GettingstartedwithparallelR Anexampleof(some)substance UsingRcpp Besmarteraboutyourcode Exercises Summary 13.ReproducibilityandBestPractices RScripting RStudio RunningRscripts Anexamplescript Scriptingandreproducibility Rprojects Versioncontrol Communicatingresults Exercises Summary Index DataAnalysiswithR DataAnalysiswithR Copyright©2015PacktPublishing Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem, ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthe publisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews. Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyofthe informationpresented.However,theinformationcontainedinthisbookissoldwithout warranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,andits dealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecaused directlyorindirectlybythisbook. PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthe companiesandproductsmentionedinthisbookbytheappropriateuseofcapitals. However,PacktPublishingcannotguaranteetheaccuracyofthisinformation. Firstpublished:December2015 Productionreference:1171215 PublishedbyPacktPublishingLtd. LiveryPlace 35LiveryStreet BirminghamB32PB,UK. ISBN978-1-78528-814-2 www.packtpub.com Credits Author TonyFischetti Reviewer DipanjanSarkar CommissioningEditor AkramHussain AcquisitionEditor MeetaRajani ContentDevelopmentEditor AnishDhurat TechnicalEditor SiddheshPatil CopyEditor SoniaMathur ProjectCoordinator BijalPatel Proofreader SafisEditing Indexer MonicaAjmeraMehta Graphics DishaHaria ProductionCoordinator ConidonMiranda CoverWork ConidonMiranda AbouttheAuthor TonyFischettiisadatascientistatCollegeFactual,wherehegetstouseReverydayto buildpersonalizedrankingsandrecommendersystems.Hegraduatedincognitivescience fromRensselaerPolytechnicInstitute,andhisthesiswasstronglyfocusedonusing statisticstostudyvisualshort-termmemory. Tonyenjoyswritingandcontributingtoopensourcesoftware,bloggingat http://www.onthelambda.com,writingabouthimselfinthirdperson,andsharinghis knowledgeusingsimple,approachablelanguageandengagingexamples. Themoretraditionallyexcitingofhisdailyactivitiesincludelisteningtorecords,playing theguitarandbass(poorly),weighttraining,andhelpingothers. BecauseI’mawareofhowincrediblyluckyIam,it’sreallyhardtoexpressallthe gratitudeIhaveforeveryoneinmylifethathelpedme—eitherdirectly,orindirectly—in completingthisbook.Thefollowing(partial)listismybestattemptatbalancing thoroughnesswhilstalsomaximizingthenumberofpeoplewhowillreadthissectionby keepingittoamanageablelength. First,I’dliketothankallofmyeducators.Inparticular,I’dliketothanktheBronxHigh SchoolofScienceandRensselaerPolytechnicInstitute.Morespecifically,I’dlikethe BronxScienceRoboticsTeam,allit’smembers,it’steammoms,thewonderfulDenaFord andCherrieFleisher-Strauss;andJustinFox.Fromthelatterinstitution,I’dliketothank allofmyprofessorsandadvisors.ShoutouttoMikeKalsher,MichaelSchoelles,Wayne Gray,BramvanHeuveln,LarryReid,andKeithAnderson(especiallyKeithAnderson). I’dliketothanktheNewYorkPublicLibrary,Wikipedia,andotherfreelyavailable educationalresources.Onarelatednote,IneedtothanktheRcommunityand,more generally,alloftheauthorsofRpackagesandotheropensourcesoftwareIusefor spendingtheirownpersonaltimetobenefithumanity.ShoutouttoGNU,theRcoreteam, andHadleyWickham(whowroteamajorityoftheRpackagesIusedaily). Next,I’dliketothankthecompanyIworkfor,CollegeFactual,andallofmybrilliantcoworkersfromwhomI’velearnedsomuch. Ialsoneedtothankmysupportnetworkofmillions,andmymanymanyfriendsthathave allhelpedmemorethantheywilllikelyeverrealize. I’dliketothankmypartner,BethanyWickham,whohasbeenabsolutelyinstrumentalin providingmuchneededandappreciatedemotionalsupportduringthewritingofthisbook, andputtingupwiththemoodswingsthatcomealongwithworkingalldayandwritingall night. Next,I’dliketoexpressmygratitudeformysister,AndreaFischetti,whomeansthe worldtome.Throughoutmylife,she’skeptmewarmandhumaninspiteofthescientist inmethatlikestogetallreductionistandcerebral. Finally,andmostimportantly,I’dliketothankmyparents.Thisbookisformyfather,to whomIowemyloveoflearningandmyinterestinscienceandstatistics;andtomy motherforherloveandunwaveringsupportand,towhomIowemyworkethicand abilitytohandleanythingandtackleanychallenge. AbouttheReviewer DipanjanSarkarisanITengineeratIntel,theworld’slargestsiliconcompany,wherehe worksonanalytics,businessintelligence,andapplicationdevelopment.Hereceivedhis master’sdegreeininformationtechnologyfromtheInternationalInstituteofInformation Technology,Bangalore.Dipanjan’sareaofspecializationincludessoftwareengineering, datascience,machinelearning,andtextanalytics. Hisinterestsincludelearningaboutnewtechnologies,disruptivestart-ups,anddata science.Inhissparetime,helovesreading,playinggames,andwatchingpopularsitcoms. DipanjanalsoreviewedLearningRforGeospatialAnalysisandRDataAnalysis Cookbook,bothbyPacktPublishing. IwouldliketothankBijalPatel,theprojectcoordinatorofthisbook,formakingthe reviewingexperiencereallyinteractiveandenjoyable. www.PacktPub.com Supportfiles,eBooks,discountoffers,and more Forsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com. DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFand ePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandas aprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwith usat<[email protected]>formoredetails. Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signup forarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooks andeBooks. https://www2.packtpub.com/books/subscription/packtlib DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigital booklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks. Whysubscribe? FullysearchableacrosseverybookpublishedbyPackt Copyandpaste,print,andbookmarkcontent Ondemandandaccessibleviaawebbrowser FreeaccessforPacktaccountholders IfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccess PacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsfor immediateaccess. Preface I’mgoingtoshootittoyoustraight:therearealotofbooksaboutdataanalysisandtheR programminglanguage.I’lltakeitonfaiththatyoualreadyknowwhyit’sextremely helpfulandfruitfultolearnRanddataanalysis(ifnot,whyareyoureadingthispreface?!) butallowmetomakeacaseforchoosingthisbooktoguideyouinyourjourney. Forone,thissubjectdidn’tcomenaturallytome.Therearethosewithaninnatetalentfor graspingtheintricaciesofstatisticsthefirsttimeitistaughttothem;Idon’tthinkI’mone ofthesepeople.IkeptatitbecauseIlovescienceandresearchandknewthatdataanalysis wasnecessary,notbecauseitimmediatelymadesensetome.Today,Ilovethesubjectin andofitself,ratherthaninstrumentally,butthisonlycameaftermonthsofheartache. Eventually,asIconsumedresourceafterresource,thepiecesofthepuzzlestartedtocome together.Afterthis,Istartedtutoringallofmyfriendsinthesubject—andhaveseenthem tripoverthesameobstaclesthatIhadtolearntoclimb.Ithinkthatcomingfromthis backgroundgivesmeauniqueperspectiveontheplightofthestatisticsstudentandallows metoreachtheminawaythatothersmaynotbeableto.Bytheway,don’tletthefact thatstatisticsusedtobafflemescareyou;IhaveitonfairlygoodauthoritythatIknow whatI’mtalkingabouttoday. Secondly,thisbookwasbornofthefrustrationthatmoststatisticstextstendtobewritten inthedriestmannerpossible.Incontrast,Iadoptalight-heartedbuoyantapproach—but withoutbecomingagonizinglyflippant. Third,thisbookincludesalotofmaterialthatIwishedwerecoveredinmoreofthe resourcesIusedwhenIwaslearningaboutdataanalysisinR.Forexample,theentirelast unitspecificallycoverstopicsthatpresentenormouschallengestoRanalystswhenthey firstgoouttoapplytheirknowledgetoimperfectreal-worlddata. Lastly,Ithoughtlongandhardabouthowtolayoutthisbookandwhichorderoftopics wasoptimal.AndwhenIsaylongandhardImeanIwrotealibraryanddesigned algorithmstodothis.TheorderinwhichIpresentthetopicsinthisbookwasvery carefullyconsideredto(a)buildontopofeachother,(b)followareasonablelevelof difficultyprogressionallowingforperiodicchaptersofrelativelysimplermaterial (psychologistscallthisintermittentreinforcement),(c)grouphighlyrelatedtopics together,and(d)minimizethenumberoftopicsthatrequireknowledgeofyetunlearned topics(thisis,unfortunately,commoninstatistics).Ifyou’reinterested,Idetailthis procedureinablogpostthatyoucanreadathttp://bit.ly/teach-stats. Thepointisthatthebookyou’reholdingisaveryspecialone—onethatIpouredmysoul into.Nevertheless,dataanalysiscanbeanotoriouslydifficultsubject,andtheremaybe timeswherenothingseemstomakesense.Duringthesetimes,rememberthatmanyothers (includingmyself)havefeltstuck,too.Persevere…therewardisgreat.Andremember,if ablockheadlikemecandoit,youcan,too.Goyou! Whatthisbookcovers Chapter1,RefresheR,reviewstheaspectsofRthatsubsequentchapterswillassume knowledgeof.Here,welearnthebasicsofRsyntax,learnR’smajordatastructures,write functions,loaddataandinstallpackages. Chapter2,TheShapeofData,discussesunivariatedata.Welearnaboutdifferentdata types,howtodescribeunivariatedata,andhowtovisualizetheshapeofthesedata. Chapter3,DescribingRelationships,goesontothesubjectofmultivariatedata.In particular,welearnaboutthethreemainclassesofbivariaterelationshipsandlearnhowto describethem. Chapter4,Probability,kicksoffanewunitbylayingfoundation.Welearnaboutbasic probabilitytheory,Bayes’theorem,andprobabilitydistributions. Chapter5,UsingDatatoReasonAbouttheWorld,discussessamplingandestimation theory.Throughexamples,welearnofthecentrallimittheorem,pointestimationand confidenceintervals. Chapter6,TestingHypotheses,introducesthesubjectofNullHypothesisSignificance Testing(NHST).Welearnmanypopularhypothesistestsandtheirnon-parametric alternatives.Mostimportantly,wegainathoroughunderstandingofthemisconceptions andgotchasofNHST. Chapter7,BayesianMethods,introducesanalternativetoNHSTbasedonamoreintuitive viewofprobability.Welearntheadvantagesanddrawbacksofthisapproach,too. Chapter8,PredictingContinuousVariables,thoroughlydiscusseslinearregression. Beforethechapter’sconclusion,welearnallaboutthetechnique,whentouseit,andwhat trapstolookoutfor. Chapter9,PredictingCategoricalVariables,introducesfourofthemostpopular classificationtechniques.Byusingallfouronthesameexamples,wegainanappreciation forwhatmakeseachtechniqueshine. Chapter10,SourcesofData,isallabouthowtousedifferentdatasourcesinR.In particular,welearnhowtointerfacewithdatabases,andrequestandloadJSONandXML viaanengagingexample. Chapter11,DealingwithMessyData,introducessomeofthesnagsofworkingwithless thanperfectdatainpractice.Thebulkofthischapterisdedicatedtomissingdata, imputation,andidentifyingandtestingformessydata. Chapter12,DealingwithLargeData,discussessomeofthetechniquesthatcanbeusedto copewithdatasetsthatarelargerthancanbehandledswiftlywithoutalittleplanning. ThekeycomponentsofthischapterareonparallelizationandRcpp. Chapter13,ReproducibilityandBestPractices,closeswiththeextremelyimportant(but oftenignored)topicofhowtouseRlikeaprofessional.Thisincludeslearningabout tooling,organization,andreproducibility. Whatyouneedforthisbook AllcodeinthisbookhasbeenwrittenagainstthelatestversionofR—3.2.2atthetimeof writing.Asamatterofgoodpractice,youshouldkeepyourRversionuptodatebutmost, ifnotall,codeshouldworkwithanyreasonablyrecentversionofR.SomeoftheR packageswewillbeinstallingwillrequiremorerecentversions,though.Fortheother softwarethatthisbookuses,instructionswillbefurnishedprorenata.Ifyouwanttogeta headstart,however,installRStudio,JAGS,andaC++compiler(orRtoolsifyouuse Windows). Whothisbookisfor Whetheryouarelearningdataanalysisforthefirsttime,oryouwanttodeepenthe understandingyoualreadyhave,thisbookwillprovetoaninvaluableresource.Ifyouare lookingforabooktobringyouallthewaythroughthefundamentalstotheapplicationof advancedandeffectiveanalyticsmethodologies,andhavesomepriorprogramming experienceandamathematicalbackground,thenthisisforyou. Conventions Inthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkinds ofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheir meaning. Codewordsintext,databasetablenames,foldernames,filenames,fileextensions, pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“Wewill usethesystem.timefunctiontotimetheexecution.” Ablockofcodeissetasfollows: library(VIM) aggr(miss_mtcars,numbers=TRUE) Anycommand-lineinputoroutputiswrittenasfollows: #R--vanillaCMDBATCHnothing.R Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen, forexample,inmenusordialogboxes,appearinthetextlikethis:“ClickingtheNext buttonmovesyoutothenextscreen.” Note Warningsorimportantnotesappearinaboxlikethis. Tip Tipsandtricksappearlikethis. Readerfeedback Feedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthis book—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsus developtitlesthatyouwillreallygetthemostoutof. Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthe book’stitleinthesubjectofyourmessage. Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingor contributingtoabook,seeourauthorguideatwww.packtpub.com/authors. Customersupport NowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelp youtogetthemostfromyourpurchase. Downloadingtheexamplecode Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.com forallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbook elsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilesemaileddirectlytoyou. Downloadingthecolorimagesofthisbook WealsoprovideyouwithaPDFfilethathascolorimagesofthescreenshots/diagrams usedinthisbook.Thecolorimageswillhelpyoubetterunderstandthechangesinthe output.Youcandownloadthisfilefrom https://www.packtpub.com/sites/default/files/downloads/Data_Analysis_With_R_ColorImages.pd Errata Althoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdo happen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthe code—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveother readersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufind anyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata, selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthe detailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedand theerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataunderthe Erratasectionofthattitle. Toviewthepreviouslysubmittederrata,goto https://www.packtpub.com/books/content/supportandenterthenameofthebookinthe searchfield.TherequiredinformationwillappearundertheErratasection. Piracy PiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.At Packt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucome acrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswith thelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy. Pleasecontactusat<[email protected]>withalinktothesuspectedpirated material. Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluable content. Questions Ifyouhaveaproblemwithanyaspectofthisbook,youcancontactusat <[email protected]>,andwewilldoourbesttoaddresstheproblem. Chapter1.RefresheR Beforewediveintothe(other)funstuff(samplingmulti-dimensionalprobability distributions,usingconvexoptimizationtofitdatamodels,andsoon),itwouldbehelpful ifwereviewthoseaspectsofRthatallsubsequentchapterswillassumeknowledgeof. IfyoufancyyourselfasanRguru,youshouldstill,atleast,skimthroughthischapter, becauseyou’llalmostcertainlyfindtheidioms,packages,andstyleintroducedheretobe beneficialinfollowingalongwiththerestofthematerial. Ifyoudon’tcaremuchaboutR(yet),andarejustinthisforthestatistics,youcanheavea heavysighofreliefthat,forthemostpart,youcanrunthecodegiveninthisbookinthe interactiveRinterpreterwithverylittlemodification,andjustfollowalongwiththeideas. However,itismybelief(read:delusion)thatbytheendofthisbook,you’llcultivatea newfoundappreciationofRalongsidearobustunderstandingofmethodsindataanalysis. FireupyourRinterpreter,andlet’sgetstarted! Navigatingthebasics IntheinteractiveRinterpreter,anylinestartingwitha>characterdenotesRaskingfor input(Ifyouseea+prompt,itmeansthatyoudidn’tfinishtypingastatementatthe promptandRisaskingyoutoprovidetherestoftheexpression.).Strikingthereturnkey willsendyourinputtoRtobeevaluated.R’sresponseisthenspitbackatyouintheline immediatelyfollowingyourinput,afterwhichRasksformoreinput.Thisiscalleda REPL(Read-Evaluate-Print-Loop).ItisalsopossibleforRtoreadabatchof commandssavedinafile(unsurprisinglycalledbatchmode),butwe’llbeusingthe interactivemodeformostofthebook. Asyoumightimagine,Rsupportsallthefamiliarmathematicaloperatorsasmostother languages: Arithmeticandassignment Checkoutthefollowingexample: >2+2 [1]4 >9/3 [1]3 >5%%2#modulusoperator(remainderof5dividedby2) [1]1 Anythingthatoccursaftertheoctothorpeorpoundsign,#,(orhash-tagforyou young’uns),isignoredbytheRinterpreter.Thisisusefulfordocumentingthecodein naturallanguage.Thesearecalledcomments. Inamulti-operationarithmeticexpression,Rwillfollowthestandardorderofoperations frommath.Inordertooverridethisnaturalorder,youhavetouseparenthesesflankingthe sub-expressionthatyou’dliketobeperformedfirst. >3+2-10^2#^istheexponentoperator [1]-95 >3+(2-10)^2 [1]67 Inpractice,almostallcompoundexpressionsaresplitupwithintermediatevalues assignedtovariableswhich,whenusedinfutureexpressions,arejustlikesubstitutingthe variablewiththevaluethatwasassignedtoit.The(primary)assignmentoperatoris<-. >#assignmentsfollowtheformVARIABLE<-VALUE >var<-10 >var [1]10 >var^2 [1]100 >VAR/2#variablenamesarecase-sensitive Error:object'VAR'notfound Noticethatthefirstandsecondlinesintheprecedingcodesnippetdidn’thaveanoutputto bedisplayed,soRjustimmediatelyaskedformoreinput.Thisisbecauseassignments don’thaveareturnvalue.Theironlyjobistogiveavaluetoavariable,ortochangethe existingvalueofavariable.Generally,operationsandfunctionsonvariablesinRdon’t changethevalueofthevariable.Instead,theyreturntheresultoftheoperation.Ifyou wanttochangeavariabletotheresultofanoperationusingthatvariable,youhaveto reassignthatvariableasfollows: >var#varis10 [1]10 >var^2 [1]100 >var#varisstill10 [1]10 >var<-var^2#noreturnvalue >var#varisnow100 [1]100 Beawarethatvariablenamesmaycontainnumbers,underscores,andperiods;thisis somethingthattripsupalotofpeoplewhoarefamiliarwithotherprogramminglanguages thatdisallowusingperiodsinvariablenames.Theonlyfurtherrestrictionsonvariable namesarethatitmuststartwithaletter(oraperiodandthenaletter),andthatitmustnot beoneofthereservedwordsinRsuchasTRUE,Inf,andsoon. Althoughthearithmeticoperatorsthatwe’veseenthusfararefunctionsintheirownright, mostfunctionsinRtaketheform:function_name(value(s)suppliedtothefunction).The valuessuppliedtothefunctionarecalledargumentsofthatfunction. >cos(3.14159)#cosinefunction [1]-1 >cos(pi)#piisaconstantthatRprovides [1]-1 >acos(-1)#arccosinefunction [1]2.141593 >acos(cos(pi))+10 [1]13.14159 >#functionscanbeusedasargumentstootherfunctions (Ifyoupaidattentioninmathclass,you’llknowthatthecosineofπis-1,andthat arccosineistheinversefunctionofcosine.) TherearehundredsofsuchusefulfunctionsdefinedinbaseR,onlyahandfulofwhichwe willseeinthisbook.Twosectionsfromnow,wewillbebuildingourveryownfunctions. Beforewemoveonfromarithmetic,itwillserveuswelltovisitsomeoftheoddvalues thatmayresultfromcertainoperations: >1/0 [1]Inf >0/0 [1]NaN ItiscommonduringpracticalusageofRtoaccidentallydividebyzero.Asyoucansee, thisundefinedoperationyieldsaninfinitevalueinR.Dividingzerobyzeroyieldsthe valueNaN,whichstandsforNotaNumber. Logicalsandcharacters Sofar,we’veonlybeendealingwithnumerics,butthereareotheratomicdatatypesinR. Towit: >foo<-TRUE#fooisofthelogicaldatatype >class(foo)#class()tellsusthetype [1]"logical" >bar<-"hi!"#barisofthecharacterdatatype >class(bar) [1]"character" Thelogicaldatatype(alsocalledBooleans)canholdthevaluesTRUEorFALSEor, equivalently,TorF.ThefamiliaroperatorsfromBooleanalgebraaredefinedforthese types: >foo [1]TRUE >foo&&TRUE#booleanand [1]TRUE >foo&&FALSE [1]FALSE >foo||FALSE#booleanor [1]TRUE >!foo#negationoperator [1]FALSE InaBooleanexpressionwithalogicalvalueandanumber,anynumberthatisnot0is interpretedasTRUE. >foo&&1 [1]TRUE >foo&&2 [1]TRUE >foo&&0 [1]FALSE Additionally,therearefunctionsandoperatorsthatreturnlogicalvaluessuchas: >4<2#lessthanoperator [1]FALSE >4>=4#greaterthanorequalto [1]TRUE >3==3#equalityoperator [1]TRUE >3!=2#inequalityoperator [1]TRUE JustastherearefunctionsinRthatareonlydefinedforworkonthenumericandlogical datatype,thereareotherfunctionsthataredesignedtoworkonlywiththecharacterdata type,alsoknownasstrings: >lang.domain<-"statistics" >lang.domain<-toupper(lang.domain) >print(lang.domain) [1]"STATISTICS" >#retrievessubstringfromfirstcharactertofourthcharacter >substr(lang.domain,1,4) [1]"STAT" >gsub("I","1",lang.domain)#substitutesevery"I"for"1" [1]"STAT1ST1CS" #combinescharacterstrings >paste("Rdoes",lang.domain,"!!!") [1]"RdoesSTATISTICS!!!" Flowofcontrol Thelasttopicinthissectionwillbeflowofcontrolconstructs. Themostbasicflowofcontrolconstructistheifstatement.Theargumenttoanif statement(whatgoesbetweentheparentheses),isanexpressionthatreturnsalogical value.Theblockofcodefollowingtheifstatementgetsexecutedonlyiftheexpression yieldsTRUE.Forexample: >if(2+2==4) +print("verygood") [1]"verygood" >if(2+2==5) +print("allhailtothethief") > Itispossibletoexecutemorethanonestatementifanifconditionistriggered;youjust havetousecurlybrackets({})tocontainthestatements. >if((4/2==2)&&(2*2==4)){ +print("fourdividedbytwoistwo…") +print("andtwotimestwoisfour") +} [1]"fourdividedbytwoistwo…" [1]"andtwotimestwoisfour" > Itisalsopossibletospecifyablockofcodethatwillgetexecutediftheifconditionalis FALSE. >closing.time<-TRUE >if(closing.time){ +print("youdon'thavetogohome") +print("butyoucan'tstayhere") +}else{ +print("youcanstayhere!") +} [1]"youdon'thavetogohome" [1]"butyoucan'tstayhere" >if(!closing.time){ +print("youdon'thavetogohome") +print("butyoucan'tstayhere") +}else{ +print("youcanstayhere!") +} [1]"youcanstayhere!" > Thereareotherflowofcontrolconstructs(likewhileandfor),butwewon’tdirectlybe usingthemmuchinthistext. GettinghelpinR Beforewegofurther,itwouldserveuswelltohaveabriefsectiondetailinghowtoget helpinR.MostRtutorialsleavethisforoneofthelastsections—ifitisevenincludedat all!Inmyownpersonalexperience,though,gettinghelpisgoingtobeoneofthefirst thingsyouwillwanttodoasyouaddmorebrickstoyourRknowledgecastle.LearningR doesn’thavetobedifficult;justtakeitslowly,askquestions,andgethelpearly.Goyou! ItiseasytogethelpwithRrightattheconsole.Runningthehelp.start()functionatthe promptwillstartamanualbrowser.Fromhere,youcandoanythingfromgoingoverthe basicsofRtoreadingthenitty-grittydetailsonhowRworksinternally. YoucangethelponaparticularfunctioninRifyouknowitsname,bysupplyingthat nameasanargumenttothehelpfunction.Forexample,let’ssayyouwanttoknowmore aboutthegsub()functionthatIsprangonyoubefore.Runningthefollowingcode: >help("gsub") >#orsimply >?gsub willdisplayamanualpagedocumentingwhatthefunctionis,howtouseit,andexamples ofitsusage. ThisrapidaccessibilitytodocumentationmeansthatI’mneverhopelesslylostwhenI encounterafunctionwhichIhaven’tseenbefore.Thedownsidetothisextraordinarily convenienthelpmechanismisthatIrarelybothertoremembertheorderofarguments, sincelookingthemupisjustsecondsaway. Occasionally,youwon’tquiteremembertheexactnameofthefunctionyou’relooking for,butyou’llhaveanideaaboutwhatthenameshouldbe.Forthis,youcanusethe help.search()function. >help.search("chisquare") >#orsimply >??chisquare Fortougher,moresemanticqueries,nothingbeatsagoodoldfashionedwebsearch engine.Ifyoudon’tgetrelevantresultsthefirsttime,tryaddingthetermprogrammingor statisticsinthereforgoodmeasure. Vectors VectorsarethemostbasicdatastructuresinR,andtheyareubiquitousindeed.Infact, eventhesinglevaluesthatwe’vebeenworkingwiththusfarwereactuallyvectorsof length1.That’swhytheinteractiveRconsolehasbeenprinting[1]alongwithallofour output. Vectorsareessentiallyanorderedcollectionofvaluesofthesameatomicdatatype. Vectorscanbearbitrarilylarge(withsomelimitations),ortheycanbejustonesingle value. Thecanonicalwayofbuildingvectorsmanuallyisbyusingthec()function(which standsforcombine). >our.vect<-c(8,6,7,5,3,0,9) >our.vect [1]8675309 Intheprecedingexample,wecreatedanumericvectoroflength7(namely,Jenny’s telephonenumber). Notethatifwetriedtoputcharacterdatatypesintothisvectorasfollows: >another.vect<-c("8",6,7,"-",3,"0",9) >another.vect [1]"8""6""7""-""3""0""9" Rwouldconvertalltheitemsinthevector(calledelements)intocharacterdatatypesto satisfytheconditionthatallelementsofavectormustbeofthesametype.Asimilarthing happenswhenyoutrytouselogicalvaluesinavectorwithnumbers;thelogicalvalues wouldbeconvertedinto1and0(forTRUEandFALSE,respectively).Theselogicalswill turnintoTRUEandFALSE(notethequotationmarks)whenusedinavectorthatcontains characters. Subsetting Itisverycommontowanttoextractoneormoreelementsfromavector.Forthis,weuse atechniquecalledindexingorsubsetting.Afterthevector,weputanintegerinsquare brackets([])calledthesubscriptoperator.ThisinstructsRtoreturntheelementatthat index.Theindices(pluralforindex,incaseyouwerewondering!)forvectorsinRstartat 1,andstopatthelengthofthevector. >our.vect[1]#togetthefirstvalue [1]8 >#thefunctionlength()returnsthelengthofavector >length(our.vect) [1]7 >our.vect[length(our.vect)]#getthelastelementofavector [1]9 Notethatintheprecedingcode,weusedafunctioninthesubscriptoperator.Incaseslike these,Revaluatestheexpressioninthesubscriptoperator,andusesthenumberitreturns astheindextoextract. Ifwegetgreedy,andtrytoextractanelementatanindexthatdoesn’texist,Rwill respondwithNA,meaning,notavailable.Weseethisspecialvaluecroppingupfromtime totimethroughoutthistext. >our.vect[10] [1]NA OneofthemostpowerfulideasinRisthatyoucanusevectorstosubsetothervectors: >#extractthefirst,third,fifth,and >#seventhelementfromourvector >our.vect[c(1,3,5,7)] [1]8739 Theabilitytousevectorstoindexothervectorsmaynotseemlikemuchnow,butits usefulnesswillbecomeclearsoon. Anotherwaytocreatevectorsisbyusingsequences. >other.vector<-1:10 >other.vector [1]12345678910 >another.vector<-seq(50,30,by=-2) >another.vector [1]5048464442403836343230 Above,the1:10statementcreatesavectorfrom1to10.10:1wouldhavecreatedthe same10elementvector,butinreverse.Theseq()functionismoregeneralinthatit allowssequencestobemadeusingsteps(amongmanyotherthings). Combiningourknowledgeofsequencesandvectorssubsettingvectors,wecangetthe first5digitsofJenny’snumberthusly: >our.vect[1:5] [1]86753 Vectorizedfunctions PartofwhatmakesRsopowerfulisthatmanyofR’sfunctionstakevectorsasarguments. Thesevectorizedfunctionsareusuallyextremelyfastandefficient.We’vealreadyseen onesuchfunction,length(),buttherearemanymanyothers. >#takesthemeanofavector >mean(our.vect) [1]5.428571 >sd(our.vect)#standarddeviation [1]3.101459 >min(our.vect) [1]0 >max(1:10) [1]10 >sum(c(1,2,3)) [1]6 Inpracticalsettings,suchaswhenreadingdatafromfiles,itiscommontohaveNAvalues invectors: >messy.vector<-c(8,6,NA,7,5,NA,3,0,9) >messy.vector [1]86NA75NA309 >length(messy.vector) [1]9 SomevectorizedfunctionswillnotallowNAvaluesbydefault.Inthesecases,anextra keywordargumentmustbesuppliedalongwiththefirstargumenttothefunction. >mean(messy.vector) [1]NA >mean(messy.vector,na.rm=TRUE) [1]5.428571 >sum(messy.vector,na.rm=FALSE) [1]NA >sum(messy.vector,na.rm=TRUE) [1]38 Asmentionedpreviously,vectorscanbeconstructedfromlogicalvaluestoo. >log.vector<-c(TRUE,TRUE,FALSE) >log.vector [1]TRUETRUEFALSE Sincelogicalvaluescanbecoercedintobehavinglikenumerics,aswesawearlier,ifwe trytosumalogicalvectorasfollows:. >sum(log.vector) [1]2 wewill,essentially,getacountofthenumberofTRUEvaluesinthatvector. TherearemanyfunctionsinRwhichoperateonvectorsandreturnlogicalvectors. is.na()isonesuchfunction.Itreturnsalogicalvector—thatis,thesamelengthasthe vectorsuppliedasanargument—withaTRUEinthepositionofeveryNAvalue.Remember ourmessyvector(fromjustaminuteago)? >messy.vector [1]86NA75NA309 >is.na(messy.vector) [1]FALSEFALSETRUEFALSEFALSETRUEFALSEFALSEFALSE >#86NA75NA309 Puttingtogetherthesepiecesofinformation,wecangetacountofthenumberofNA valuesinavectorasfollows: >sum(is.na(messy.vector)) [1]2 WhenyouuseBooleanoperatorsonvectors,theyalsoreturnlogicalvectorsofthesame lengthasthevectorbeingoperatedon. >our.vect>5 [1]TRUETRUETRUEFALSEFALSEFALSETRUE Ifwewantedto—andwedo—countthenumberofdigitsinJenny’sphonenumberthat aregreaterthanfive,wewoulddosointhefollowingmanner: >sum(our.vect>5) [1]4 Advancedsubsetting DidImentionthatwecanusevectorstosubsetothervectors?Whenwesubsetvectors usinglogicalvectorsofthesamelength,onlytheelementscorrespondingtotheTRUE valuesareextracted.Hopefully,sparksarestartingtogooffinyourhead.Ifwewantedto extractonlythelegitimatenon-NAdigitsfromJenny’snumber,wecandoitasfollows: >messy.vector[!is.na(messy.vector)] [1]8675309 ThisisaverycriticaltraitofR,solet’stakeourtimeunderstandingit;thisidiomwill comeupagainandagainthroughoutthisbook. ThelogicalvectorthatyieldsTRUEwhenanNAvalueoccursinmessy.vector(from is.na())isthennegated(thewholething)bythenegationoperator!.Theresultantvector isTRUEwheneverthecorrespondingvalueinmessy.vectorisnotNA.Whenthislogical vectorisusedtosubsettheoriginalmessyvector,itonlyextractsthenon-NAvaluesfrom it. Similarly,wecanshowallthedigitsinJenny’sphonenumberthataregreaterthanfiveas follows: >our.vect[our.vect>5] [1]8679 Thusfar,we’veonlybeendisplayingelementsthathavebeenextractedfromavector. However,justaswe’vebeenassigningandre-assigningvariables,wecanassignvaluesto variousindicesofavector,andchangethevectorasaresult.Forexample,ifJennytellsus thatwehavethefirstdigitofherphonenumberwrong(it’sreally9),wecanreassignjust thatelementwithoutmodifyingtheothers. >our.vect [1]8675309 >our.vect[1]<-9 >our.vect [1]9675309 Sometimes,itmayberequiredtoreplacealltheNAvaluesinavectorwiththevalue0.To dothatwithourmessyvector,wecanexecutethefollowingcommand: >messy.vector[is.na(messy.vector)]<-0 >messy.vector [1]860750309 Elegantthoughtheprecedingsolutionis,modifyingavectorinplaceisusually discouragedinfavorofcreatingacopyoftheoriginalvectorandmodifyingthecopy.One suchtechniqueforperformingthisisbyusingtheifelse()function. Nottobeconfusedwiththeif/elsecontrolconstruct,ifelse()isafunctionthattakes3 arguments:atestthatreturnsalogical/Booleanvalue,avaluetouseiftheelementpasses thetest,andonetoreturniftheelementfailsthetest. Theprecedingin-placemodificationsolutioncouldbere-implementedwithifelseas follows: >ifelse(is.na(messy.vector),0,messy.vector) [1]860750309 Recycling ThelastimportantpropertyofvectorsandvectoroperationsinRisthattheycanbe recycled.TounderstandwhatImean,examinethefollowingexpression: >our.vect+3 [1]1291086312 ThisexpressionaddsthreetoeachdigitinJenny’sphonenumber.Althoughitmaylook so,Risnotperformingthisoperationbetweenavectorandasinglevalue.Remember whenIsaidthatsinglevaluesareactuallyvectorsofthelength1?Whatisreally happeninghereisthatRistoldtoperformelement-wiseadditiononavectoroflength7 andavectoroflength1.Sinceelement-wiseadditionisnotdefinedforvectorsofdiffering lengths,Rrecyclesthesmallervectoruntilitreachesthesamelengthasthatofthebigger vector.Onceboththevectorsarethesamesize,thenR,element-by-element,performsthe additionandreturnstheresult. >our.vect+3 [1]1291086312 istantamountto… >our.vect+c(3,3,3,3,3,3,3) [1]1291086312 IfwewantedtoextracteveryotherdigitfromJenny’sphonenumber,wecandosointhe followingmanner: >our.vect[c(TRUE,FALSE)] [1]9739 Thisworksbecausethevectorc(TRUE,FALSE)isrepeateduntilitisofthelength7, makingitequivalenttothefollowing: >our.vect[c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE)] [1]9739 OnecommonsnagrelatedtovectorrecyclingthatRusers(useRs,ifImay)encounteris thatduringsomearithmeticoperationsinvolvingvectorsofdiscrepantlength,Rwillwarn youifthesmallervectorcannotberepeatedawholenumberoftimestoreachthelength ofthebiggervector.Thisisnotaproblemwhendoingvectorarithmeticwithsingle values,since1canberepeatedanynumberoftimestomatchthelengthofanyvector (whichmust,ofcourse,beaninteger).Itwouldposeaproblem,though,ifwewere lookingtoaddthreetoeveryotherelementinJenny’sphonenumber. >our.vect+c(3,0) [1]1261056012 Warningmessage: Inour.vect+c(3,0): longerobjectlengthisnotamultipleofshorterobjectlength Youwilllikelylearntolovethesewarnings,astheyhavestoppedmanyuseRsfrom makinggraveerrors. Beforewemoveontothenextsection,animportantthingtonoteisthatinalotofother programminglanguages,manyofthethingsthatwedidwouldhavebeenimplemented usingforloopsandothercontrolstructures.Althoughthereiscertainlyaplaceforloops andsuchinR,oftentimesamoresophisticatedsolutionexistsinusingjustvector/matrix operations.Inadditiontoeleganceandbrevity,thesolutionthatexploitsvectorizationand recyclingisoftenmany,manytimesmoreefficient. Functions Ifweneedtoperformsomecomputationthatisn’talreadyafunctioninRamultiple numberoftimes,weusuallydosobydefiningourownfunctions.AcustomfunctioninR isdefinedusingthefollowingsyntax: function.name<-function(argument1,argument2,...){ #somefunctionality } Forexample,ifwewantedtowriteafunctionthatdeterminedifanumbersuppliedasan argumentwaseven,wecandosointhefollowingmanner: >is.even<-function(a.number){ +remainder<-a.number%%2 +if(remainder==0) +return(TRUE) +return(FALSE) +} > >#testingit >is.even(10) [1]TRUE >is.even(9) [1]FALSE Asanexampleofafunctionthattakesmorethanoneargument,let’sgeneralizethe precedingfunctionbycreatingafunctionthatdetermineswhetherthefirstargumentis divisiblebyitssecondargument. >is.divisible.by<-function(large.number,smaller.number){ +if(large.number%%smaller.number!=0) +return(FALSE) +return(TRUE) + } > >#testingit >is.divisible.by(10,2) [1]TRUE >is.divisible.by(10,3) [1]FALSE >is.divisible.by(9,3) [1]TRUE Ourfunction,is.even(),couldnowberewrittensimplyas: >is.even<-function(num){ +is.divisible.by(num,2) +} ItisverycommoninRtowanttoapplyaparticularfunctiontoeveryelementofavector. Insteadofusingalooptoiterateovertheelementsofavector,aswewoulddoinmany otherlanguages,weuseafunctioncalledsapply()toperformthis.sapply()takesa vectorandafunctionasitsargument.Itthenappliesthefunctiontoeveryelementand returnsavectorofresults.Wecanusesapply()inthismannertofindoutwhichdigitsin Jenny’sphonenumberareeven: >sapply(our.vect,is.even) [1]FALSETRUEFALSEFALSEFALSETRUEFALSE Thisworkedgreatbecausesapplytakeseachelement,andusesitastheargumentin is.even()whichtakesonlyoneargument.Ifyouwantedtofindthedigitsthatare divisiblebythree,itwouldrequirealittlebitmorework. Oneoptionisjusttodefineafunctionis.divisible.by.three()thattakesonlyone argument,andusethatinsapply.Themorecommonsolution,however,istodefinean unnamedfunctionthatdoesjustthatinthebodyofthesapplyfunctioncall: >sapply(our.vect,function(num){is.divisible.by(num,3)}) [1]TRUETRUEFALSEFALSETRUETRUETRUE Here,weessentiallycreatedafunctionthatcheckswhetheritsargumentisdivisibleby three,exceptwedon’tassignittoavariable,anduseitdirectlyinthesapplybody instead.Theseone-time-useunnamedfunctionsarecalledanonymousfunctionsorlambda functions.(ThenamecomesfromAlonzoChurch’sinventionofthelambdacalculus,if youwerewondering.) ThisissomewhatofanadvancedusageofR,butitisveryusefulasitcomesupveryoften inpractice. IfwewantedtoextractthedigitsinJenny’sphonenumberthataredivisiblebyboth,two andthree,wecanwriteitasfollows: >where.even<-sapply(our.vect,is.even) >where.div.3<-sapply(our.vect,function(num){ +is.divisible.by(num,3)}) >#"&"islikethe"&&"andoperatorbutforvectors >our.vect[where.even&where.div.3] [1]60 Neat-O! Notethatifwewantedtobesticklers,wewouldhaveaclauseinthefunctionbodiesto precludeamoduluscomputation,wherethefirstnumberwassmallerthanthesecond.If wehad,ourfunctionwouldnothaveerroneouslyindicatedthat0wasdivisiblebytwoand three.I’mnotastickler,though,sothefunctionswillremainasis.Fixingthisfunctionis leftasanexerciseforthe(stickler)reader. Matrices Inadditiontothevectordatastructure,Rhasthematrix,dataframe,list,andarraydata structures.Thoughwewillbeusingallthesetypes(exceptarrays)inthisbook,weonly needtoreviewthefirsttwointhischapter. AmatrixinR,likeinmath,isarectangulararrayofvalues(ofonetype)arrangedinrows andcolumns,andcanbemanipulatedasawhole.Operationsonmatricesarefundamental todataanalysis. Onewayofcreatingamatrixistojustsupplyavectortothefunctionmatrix(). >a.matrix<-matrix(c(1,2,3,4,5,6)) >a.matrix [,1] [1,]1 [2,]2 [3,]3 [4,]4 [5,]5 [6,]6 Thisproducesamatrixwithallthesuppliedvaluesinasinglecolumn.Wecanmakea similarmatrixwithtwocolumnsbysupplyingmatrix()withanoptionalargument,ncol, thatspecifiesthenumberofcolumns. >a.matrix<-matrix(c(1,2,3,4,5,6),ncol=2) >a.matrix [,1][,2] [1,]14 [2,]25 [3,]36 Wecouldhaveproducedthesamematrixbybindingtwovectors,c(1,2,3)andc(4, 5,6)bycolumnsusingthecbind()functionasfollows: >a2.matrix<-cbind(c(1,2,3),c(4,5,6)) Wecouldcreatethetranspositionofthismatrix(whererowsandcolumnsareswitched)by bindingthosevectorsbyrowinstead: >a3.matrix<-rbind(c(1,2,3),c(4,5,6)) >a3.matrix [,1][,2][,3] [1,]123 [2,]456 orbyjustusingthematrixtranspositionfunctioninR,t(). >t(a2.matrix) SomeotherfunctionsthatoperateonwholevectorsarerowSums()/colSums()and rowMeans()/colMeans(). >a2.matrix [,1][,2] [1,]14 [2,]25 [3,]36 >colSums(a2.matrix) [1]615 >rowMeans(a2.matrix) [1]2.53.54.5 Ifvectorshavesapply(),thenmatriceshaveapply().Theprecedingtwofunctionscould havebeenwritten,moreverbosely,as: >apply(a2.matrix,2,sum) [1]615 >apply(a2.matrix,1,mean) [1]2.53.54.5 where1instructsRtoperformthesuppliedfunctionoveritsrows,and2,overits columns. ThematrixmultiplicationoperatorinRis%*% >a2.matrix%*%a2.matrix Errorina2.matrix%*%a2.matrix:non-conformablearguments Remember,matrixmultiplicationisonlydefinedformatriceswherethenumberof columnsinthefirstmatrixisequaltothenumberofrowsinthesecond. >a2.matrix [,1][,2] [1,]14 [2,]25 [3,]36 >a3.matrix [,1][,2][,3] [1,]123 [2,]456 >a2.matrix%*%a3.matrix [,1][,2][,3] [1,]172227 [2,]222936 [3,]273645 > >#dim()tellsushowmanyrowsandcolumns >#(respectively)thereareinthegivenmatrix >dim(a2.matrix) [1]32 Toindextheelementofamatrixatthesecondrowandfirstcolumn,youneedtosupply bothofthesenumbersintothesubscriptingoperator. >a2.matrix[2,1] [1]2 ManyuseRsgetconfusedandforgettheorderinwhichtheindicesmustappear; remember—it’srowfirst,thencolumns! Ifyouleaveoneofthespacesempty,Rwillassumeyouwantthatwholedimension: >#returnsthewholesecondcolumn >a2.matrix[,2] [1]456 >#returnsthefirstrow >a2.matrix[1,] [1]14 And,asalways,wecanusevectorsinoursubscriptoperator: >#givemeelementincolumn2atthefirstandthirdrow >a2.matrix[c(1,3),2] [1]46 LoadingdataintoR Thusfar,we’veonlybeenenteringdatadirectlyintotheinteractiveRconsole.Forany datasetofnon-trivialsizethisis,obviously,anintractablesolution.Fortunatelyforus,R hasarobustsuiteoffunctionsforreadingdatadirectlyfromexternalfiles. Goahead,andcreateafileonyourharddiskcalledfavorites.txtthatlookslikethis: flavor,number pistachio,6 mintchocolatechip,7 vanilla,5 chocolate,10 strawberry,2 neopolitan,4 Thisdatarepresentsthenumberofstudentsinaclassthatpreferaparticularflavorofsoy icecream.Wecanreadthefileintoavariablecalledfavsasfollows: >favs<-read.table("favorites.txt",sep=",",header=TRUE) Ifyougetanerrorthatthereisnosuchfileordirectory,giveRthefullpathnametoyour datasetor,alternatively,runthefollowingcommand: >favs<-read.table(file.choose(),sep=",",header=TRUE) Theprecedingcommandbringsupanopenfiledialogforlettingyounavigatetothefile you’vejustcreated. Theargumentsep=","tellsRthateachdataelementinarowisseparatedbyacomma. Othercommondataformatshavevaluesseparatedbytabsandpipes("|").Thevalueof sepshouldthenbe"\t"and"|",respectively. Theargumentheader=TRUEtellsRthatthefirstrowofthefileshouldbeinterpretedasthe namesofthecolumns.Remember,youcanenter?read.tableattheconsoletolearnmore abouttheseoptions. Readingfromfilesinthiscomma-separated-valuesformat(usuallywiththe.csvfile extension)issocommonthatRhasamorespecificfunctionjustforit.Theprecedingdata importexpressioncanbebestwrittensimplyas: >favs<-read.csv("favorites.txt") Now,wehaveallthedatainthefileheldinavariableofclassdata.frame.Adataframe canbethoughtofasarectangulararrayofdatathatyoumightseeinaspreadsheet application.Inthisway,adataframecanalsobethoughtofasamatrix;indeed,wecan usematrix-styleindexingtoextractelementsfromit.Adataframediffersfromamatrix, though,inthatadataframemayhavecolumnsofdifferingtypes.Forexample,whereasa matrixwouldonlyallowoneofthesetypes,thedatasetwejustloadedcontainscharacter datainitsfirstcolumn,andnumericdatainitssecondcolumn. Let’scheckoutwhatwehavebyusingthehead()command,whichwillshowusthefirst fewlinesofadataframe: >head(favs) flavornumber 1pistachio6 2mintchocolatechip7 3vanilla5 4chocolate10 5strawberry2 6neopolitan4 >class(favs) [1]"data.frame" >class(favs$flavor) [1]"factor" >class(favs$number) [1]"numeric" Ilied,ok!Sowhat?!Technically,flavorisafactordatatype,notacharactertype. Wehaven’tseenfactorsyet,buttheideabehindthemisreallysimple.Essentially,factors arecodingsforcategoricalvariables,whicharevariablesthattakeononeofafinite numberofcategories—think{"high","medium",and"low"}or{"control", "experimental"}. ThoughfactorsareextremelyusefulinstatisticalmodelinginR,thefactthatR,by default,automaticallyinterpretsacolumnfromthedatareadfromdiskasatypefactorifit containscharacters,issomethingthattripsupnovicesandseasoneduseRsalike.Because ofthis,wewillprimarilypreventthisbehaviormanuallybyaddingthestringsAsFactors optionalkeywordargumenttotheread.*commands: >favs<-read.csv("favorites.txt",stringsAsFactors=FALSE) >class(favs$flavor) [1]"character" Muchbetter,fornow!Ifyou’dliketomakethisbehaviorthenewdefault,readthe? optionsmanualpage.Wecanalwaysconverttofactorslateronifweneedto! Ifyouhaven’tnoticedalready,I’vesnuckanewoperatoronyou—$,theextractoperator. Thisisthemostpopularwaytoextractattributes(orcolumns)fromadataframe.Youcan alsousedoublesquarebrackets([[and]])todothis. Thesearebothinadditiontothecanonicalmatrixindexingoption.Thefollowingthree statementsarethus,inthiscontext,functionallyidentical: >favs$flavor [1]"pistachio""mintchocolatechip""vanilla" [4]"chocolate""strawberry""neopolitan" >favs[["flavor"]] [1]"pistachio""mintchocolatechip""vanilla" [4]"chocolate""strawberry""neopolitan" >favs[,1] [1]"pistachio""mintchocolatechip""vanilla" [4]"chocolate""strawberry""neopolitan" Note NoticehowRhasnowprintedanothernumberinsquarebrackets—besides[1]—along withouroutput.Thisistoshowusthatchocolateisthefourthelementofthevectorthat wasreturnedfromtheextraction. Youcanusethenames()functiontogetalistofthecolumnsavailableinadataframe. Youcanevenreassignnamesusingthesame: >names(favs) [1]"flavor""number" >names(favs)[1]<-"flav" >names(favs) [1]"flav""number" Lastly,wecangetacompactdisplayofthestructureofadataframebyusingthestr() functiononit: >str(favs) 'data.frame':6obs.of2variables: $flav:chr"pistachio""mintchocolatechip""vanilla""chocolate" ... $number:num6751024 Actually,youcanusethisfunctiononanyRstructure—thepropertyoffunctionsthat changetheirbehaviorbasedonthetypeofinputiscalledpolymorphism. Workingwithpackages Robust,performant,andnumerousthoughbaseR’sfunctionsare,wearebynomeans limitedtothem!Additionalfunctionalityisavailableintheformofpackages.Infact,what makesRsuchaformidablestatisticsplatformistheastonishingwealthofpackages available(wellover7,000atthetimeofwriting).R’secosystemissecondtonone! MostofthesemyriadpackagesexistontheComprehensiveRArchiveNetwork (CRAN).CRANistheprimaryrepositoryforuser-createdpackages. Onepackagethatwearegoingtostartusingrightawayistheggplot2package.ggplot2is aplottingsystemforR.BaseRhassophisticatedandadvancedmechanismstoplotdata, butmanyfindggplot2moreconsistentandeasiertouse.Further,theplotsareoftenmore aestheticallypleasingbydefault. Let’sinstallit! #downloadsandinstallsfromCRAN >install.packages("ggplot2") Nowthatwehavethepackagedownloaded,let’sloaditintotheRsession,andtestitout byplottingourdatafromthelastsection: >library(ggplot2) >ggplot(favs,aes(x=flav,y=number))+ +geom_bar(stat="identity")+ +ggtitle("Soyicecreamflavorpreferences") Figure1.1:Soyicecreamflavorpreferences You’reallwrong,MintChocolateChipiswaybetter! Don’tworryaboutthesyntaxoftheggplotfunction,yet.We’llgettoitingoodtime. Youwillbeinstallingsomemorepackagesasyouworkthroughthistext.Inthe meantime,ifyouwanttoplayaroundwithafewmorepackages,youcaninstallthegdata andforeignpackagesthatallowyoutodirectlyimportExcelspreadsheetsandSPSSdata filesrespectivelydirectlyintoR. Exercises Youcanpracticethefollowingexercisestohelpyougetagoodgraspoftheconcepts learnedinthischapter: Writeafunctioncalledsimon.saysthattakesinacharacterstring,andreturnsthat stringinalluppercaseafterprependingthestring“Simonsays:”tothebeginningof it. Writeafunctionthattakestwomatricesasarguments,andreturnsalogicalvalue representingwhetherthematricescanbematrixmultiplied. Findafreedatasetontheweb,downloadit,andloaditintoR.Explorethestructure ofthedataset. ReflectuponhowHesterPrynneallowedherscarletlettertobedecoratedwith flowersbyherdaughterinChapter10.TowhatextentisthisindicativeofHester’s recastingofthescarletletterasapositivepartofheridentity.Backupyourthesis withexcerptsfromthebook. Summary Inthischapter,welearnedabouttheworld’sgreatestanalyticsplatform,R.Westarted fromthebeginningandbuiltafoundation,andwillnowexploreRfurther,basedonthe knowledgegainedinthischapter.Bynow,youhavebecomewellversedinthebasicsofR (which,paradoxically,isthehardestpart).Younowknowhowto: UseRasabigcalculatortodoarithmetic Makevectors,operateonthem,andsubsetthemexpressively Loaddatafromdisk Installpackages YouhavebynomeansfinishedlearningaboutR;indeed,wehavegoneovermostlyjust thebasics.However,wehaveenoughtocontinueahead,andyou’llpickupmorealong theway.Onwardtostatisticsland! Chapter2.TheShapeofData Welcomeback!SincewenowhaveenoughknowledgeaboutRunderourbelt,wecan finallymoveontoapplyingit.So,joinmeaswejumpoutoftheRfryingpanandintothe statisticsfire. Univariatedata Inthischapter,wearegoingtodealwithunivariatedata,whichisafancywayofsaying samplesofonevariable—thekindofdatathatgoesintoasingleRvector.Analysisof univariatedataisn’tconcernedwiththewhyquestions—causes,relationships,oranything likethat;thepurposeofunivariateanalysisissimplytodescribe. Inunivariatedata,onevariable—let’scallitx—canrepresentcategorieslikesoyice creamflavors,headsortails,namesofcuteclassmates,therollofadie,andsoon.In caseslikethese,wecallxacategoricalvariable. >categorical.data<-c("heads","tails","tails","heads") Categoricaldataisrepresented,intheprecedingstatement,asavectorofcharactertype. Inthisparticularexample,wecouldfurtherspecifythatthisisabinaryordichotomous variable,becauseitonlytakesontwovalues,namely,“heads”and“tails.” Ourvariablexcouldalsorepresentanumberlikeairtemperature,thepricesoffinancial instruments,andsoon.Insuchcases,wecallthisacontinuousvariable. >contin.data<-c(198.41,178.46,165.20,141.71,138.77) Univariatedataofacontinuousvariableisrepresented,asseenintheprecedingstatement, asavectorofnumerictype.Thesedataarethestockpricesofahypotheticalcompanythat offersahypotheticalcommercialstatisticsplatforminferiortoR. Youmightcometotheconclusionthatifavectorcontainscharactertypes,itisa categoricalvariable,andifitcontainsnumerictypes,itisacontinuousvariable.Notquite! Considerthecaseofdatathatcontainstheresultsoftherollofasix-sideddie.Anatural approachtostoringthiswouldbebyusinganumericvector.However,thisisn’ta continuousvariable,becauseeachresultcanonlytakeonsixdistinctvalues:1,2,3,4,5, and6.Thisisadiscretenumericvariable.Otherdiscretenumericvariablescanbethe numberofbacteriainapetridish,orthenumberofloveletterstocuteclassmates. Themarkofacontinuousvariableisthatitcouldtakeonanyvaluebetweensome theoreticalminimumandmaximum.Therangeofvaluesincaseofadierollhavea minimumof1andamaximumof6,butitcanneverbe2.3.Contrastthiswith,say,the exampleofthestockprices,whichcouldbezero,zillions,oranythinginbetween. Onoccasion,weareunabletoneatlyclassifynon-categoricaldataaseithercontinuousor discrete.Insomecases,discretevariablesmaybetreatedasifthereisanunderlying continuum.Additionally,continuousvariablescanbediscretized,aswe’llseesoon. Frequencydistributions Acommonwayofdescribingunivariatedataiswithafrequencydistribution.We’ve alreadyseenanexampleofafrequencydistributionwhenwelookedatthepreferencesfor soyicecreamattheendofthelastchapter.Foreachflavoroficecream(categorical variable),itdepictedthecountorfrequencyoftheoccurrencesintheunderlyingdataset. Todemonstrateexamplesofotherfrequencydistributions,weneedtofindsomedata. Fortunately,fortheconvenienceofuseRseverywhere,Rcomespreloadedwithalmostone hundreddatasets.Youcanviewafulllistifyouexecutehelp(package="datasets"). Therearealsohundredsmoreavailablefromaddonpackages. Thefirstdatasetthatwearegoingtouseismtcars—dataonthedesignandperformance of32automobilesthatwasextractedfromthe1974MotorTrendUSmagazine.(Tofind outmoreinformationaboutthisdataset,execute?mtcars.) Takealookatthefirstfewlinesofthisdatasetusingtheheadfunction: >head(mtcars) mpgcyldisphpdratwtqsecvsamgearcarb MazdaRX421.061601103.902.62016.460144 MazdaRX4Wag21.061601103.902.87517.020144 Datsun71022.84108933.852.32018.611141 Hornet4Drive21.462581103.083.21519.441031 HornetSportabout18.783601753.153.44017.020032 Valiant18.162251052.763.46020.221031 Checkoutthecarbcolumn,whichrepresentsthenumberofcarburetors;bynowyou shouldrecognizethisasadiscretenumericvariable,thoughwecan(andwill!)treatthisas acategoricalvariablefornow. Runningthecarbvectorthroughtheuniquefunctionyieldsthedistinctvaluesthatthis vectorcontains. >unique(mtcars$carb) [1]412368 Wecanseethattheremustberepeatsinthecarbvector,buthowmany?Aneasywayfor performingafrequencytabulationinRistousethetablefunction: >table(mtcars$carb) 123468 71031011 Fromtheresultoftheprecedingfunction,wecantellthattheare10carswith2 carburetorsand10with4,andthereisonecareachwith6and8carburetors.Thevalue withthemostoccurrencesinadataset(inthisexample,thecarbcolumnisourwholedata set)iscalledthemode.Inthiscase,therearetwosuchvalues,2and4,sothisdatasetis bimodal.(ThereisapackageinR,calledmodeest,tofindmodeseasily.) Frequencydistributionsaremoreoftendepictedasachartorplotthanasatableof numbers.Whentheunivariatedataiscategorical,itiscommonlyrepresentedasabar chart,asshownintheFigure2.1: Theotherdatasetthatwearegoingtousetodemonstrateafrequencydistributionofa continuousvariableistheairqualitydataset,whichholdsthedailyairquality measurementsfromMaytoSeptemberinNY.Takealookatitusingtheheadandstr functions.TheunivariatedatathatwewillbeusingistheTempcolumn,whichcontainsthe temperaturedataindegreesFahrenheit. Figure2.1:Frequencydistributionofnumberofcarburetorsinmtcarsdataset Itwouldbeuselesstotakethesameapproachtofrequencytabulationaswedidinthecase ofthecarcarburetors.Ifwedidso,wewouldhaveatablecontainingthefrequenciesfor eachofthe40uniquetemperatures—andtherewouldbefarmoreifthetemperature wasn’troundedtothenearestdegree.Additionally,whocaresthattherewasone occurrenceof63degreesandtwooccurrencesof64?Isuredon’t!Whatwedocareabout istheapproximatetemperature. Ourfirststeptowardsbuildingafrequencydistributionofthetemperaturedataistobin thedata—whichistosay,wedividetherangeofvaluesofthevectorintoaseriesof smallerintervals.Thisbinningisamethodofdiscretizingacontinuousvariable.Wethen countthenumberofvaluesthatfallintothatinterval. Choosingthesizeofbinstouseistricky.Iftherearetoomanybins,werunintothesame problemaswedidwiththerawdataandhaveanunwieldynumberofcolumnsinour frequencytabulation.Ifwemaketoofew,however,weloseresolutionandmaylose importantinformation.Choosingtherightnumberofbinsismoreartthanscience,but therearecertaincommonlyusedheuristicsthatoftenproducesensibleresults. WecanhaveRconstructnnumberofequally-spacedbinsforusbyusingthecutfunction which,initssimplestusecase,takesavectorofdataandthenumberofbinstocreate: >cut(airquality$Temp,9) Wecanthenfeedthisresultintothetablefunctionforafarmoremanageablefrequency tabulation: >table(cut(airquality$Temp,9)) (56,60.6](60.6,65.1](65.1,69.7](69.7,74.2](74.2,78.8] 810141626 (78.8,83.3](83.3,87.9](87.9,92.4](92.4,97] 3522157 Rad! Rememberwhenweusedabarcharttovisualizethefrequencydistributionsofcategorical data?Thecommonmethodforvisualizingthedistributionofdiscretizedcontinuousdata isbyusingahistogram,asseeninthefollowingimage: Figure2.2:DailytemperaturemeasurementsfromMaytoSeptemberinNYC Centraltendency OneverypopularquestiontoaskaboutunivariatedataisWhatisthetypicalvalue?or What’sthevaluearoundwhichthedataarecentered?.Toanswerthesequestions,wehave tomeasurethecentraltendencyofasetofdata. We’veseenonemeasureofcentraltendencyalready:themode.Themtcars$carburetors datasubsetwasbimodal,withatwoandfourcarburetorsetupbeingthemostpopular.The modeisthecentraltendencymeasurethatisapplicabletocategoricaldata. Themodeofadiscretizedcontinuousdistributionisusuallyconsideredtobetheinterval thatcontainsthehighestfrequencyofdatapoints.Thismakesitdependentonthemethod andparametersofthebinning.Findingthemodeofdatafromanon-discretized continuousdistributionisamorecomplicatedprocedure,whichwe’llseelater. Perhapsthemostfamousandcommonlyusedmeasureofcentraltendencyisthemean. Themeanisthesumofasetofnumericsdividedbythenumberofelementsinthatset. Thissimpleconceptcanalsobeexpressedasacomplex-lookingequation: Where (pronouncedxbar)isthemean, isthesummationoftheelementsinthe dataset,andnisthenumberofelementsintheset.(Asanaside,ifyouareintimidatedby theequationsinthisbook,don’tbe!Noneofthemarebeyondyourgrasp—justthinkof themassentencesofalanguageyou’renotproficientinyet.) Themeanisrepresentedas whenwearetalkingaboutthemeanofasample(orsubset) ofalargerpopulation,andµwhenwearetalkingaboutthemeanofthepopulation.A populationmayhavetoomanyitemstocomputethemeandirectly.Whenthisisthecase, werelyonstatisticsappliedtoasampleofthepopulationtoestimateitsparameters. AnotherwaytoexpresstheprecedingequationusingRconstructsisasfollows: >sum(nums)/length(nums)#numswouldbeavectorofnumerics Asyoumightimagine,though,themeanhasaneponymousRfunctionthatisbuilt-in already: >mean(c(1,2,3,4,5)) [1]3 Themeanisnotdefinedforcategoricaldata;rememberthatmodeistheonlymeasureof centraltendencythatwecanusewithcategoricaldata. Themean—occasionallyreferredtoasthearithmeticmeantocontrastwiththefarless oftenusedgeometric,harmonic,andtrimmedmeans—whileextraordinarilypopularisnot averyrobuststatistic.Thisisbecausethestatisticisundulyaffectedbyoutliers(atypically distantdatapointsorobservations).Aparadigmaticexamplewheretherobustnessofthe meanfailsisitsapplicationtothedifferentdistributionsofincome. ImaginethewagesofemployeesinacompanycalledMarx&Engels,AttorneysatLaw, wherethetypicalworkermakes$40,000ayearwhiletheCEOmakes$500,000ayear.If wecomputethemeanofthesalariesbasedonasampleoftenthatcontainsjustthe exploitedclass,wewillhaveafairlyaccuraterepresentationoftheaveragesalaryofa workeratthatcompany.Ifhowever,bytheluckofthedraw,oursamplecontainsthe CEO,themeanofthesalarieswillskyrockettoavaluethatisnolongerrepresentativeor veryinformative. Morespecifically,robuststatisticsarestatisticalmeasuresthatworkwellwhenthrownata widevarietyofdifferentdistributions.Themeanworkswellwithoneparticulartypeof distribution,thenormaldistribution,and,tovaryingdegrees,failstoaccuratelyrepresent thecentraltendencyofotherdistributions. Figure2.3:Anormaldistribution Thenormaldistribution(alsocalledtheGaussiandistributionifyouwanttoimpress people)isfrequentlyreferredtoasthebellcurvebecauseofitsshape.Asseeninthe precedingimage,thevastmajorityofthedatapointsliewithinanarrowbandaroundthe centerofthedistribution—whichisthemean.Asyougetfurtherandfurtherfromthe mean,theobservationsbecomelessandlessfrequent.Itisasymmetricdistribution, meaningthatthesidethatistotherightofthemeanisamirrorimageoftheleftsideof themean. Notonlyistheusageofthenormaldistributionextremelycommoninstatistics,butitis alsoubiquitousinreallife,whereitcanmodelanythingfrompeople’sheightstotest scores;afewwillfarelowerthanaverage,andafewfarehigherthanaverage,butmost arearoundaverage. Theutilityofthemeanasameasureofcentraltendencybecomesstrainedasthenormal distributionbecomesmoreandmoreskewed,orasymmetrical. Ifthemajorityofthedatapointsfallontheleftsideofthedistribution,withtherightside taperingoffslowerthantheleft,thedistributionisconsideredpositivelyskewedorrighttailed.Ifthelongertailisontheleftsideandthebulkofthedistributionishangingoutto theright,itiscallednegativelyskewedorleft-tailed.Thiscanbeseenclearlyinthe followingimages: Figure2.4a:Anegativelyskeweddistribution Figure2.4b:Apositivelyskeweddistribution Luckily,forcasesofskeweddistributions,orotherdistributionsforwhichthemeanis inadequatetodescribe,wecanusethemedianinstead. Themedianofadatasetisthemiddlenumberinthesetafteritissorted.Lessconcretely, itisthevaluethatcleanlyseparatesthehigher-valuedhalfofthedataandthelower-valued half. Themedianofthesetofnumbers{1,3,5,6,7}is5.Inthesetofnumberswithan evennumberofelements,themeanofthetwomiddlevaluesistakentobethemedian. Forexample,themedianoftheset{3,3,6,7,7,10}is6.5.Themedianisthe50th percentile,meaningthat50percentoftheobservationsfallbelowthatvalue. >median(c(3,7,6,10,3,7)) [1]6.5 ConsidertheexampleofMarx&Engels,AttorneysatLawthatwereferredtoearlier. Rememberthatifthesampleofemployees’salariesincludedtheCEO,itwouldgiveour meananon-representativevalue.Themediansolvesourproblembeautifully.Let’ssayour sampleof10employees’salarieswas{41000,40300,38000,500000,41500,37000, 39600,42000,39900,39500}.Giventhisset,themeansalaryis$85,880butthemedianis $40,100—waymoreinlinewiththesalaryexpectationsoftheproletariatatthelawfirm. Insymmetricdata,themeanandmedianareoftenveryclosetoeachotherinvalue,ifnot identical.Inasymmetricdata,thisisnotthecase.Itistellingwhenthemedianandthe meanareverydiscrepant.Ingeneral,ifthemedianislessthanthemean,thedatasethasa largerighttailoroutliers/anomalies/erroneousdatatotherightofthedistribution.Ifthe meanislessthanthemedian,ittellstheoppositestory.Thedegreeofdifferencebetween themeanandthemedianisoftenanindicationofthedegreeofskewness. Thispropertyofthemedian—resistancetotheinfluenceofoutliers—makesitarobust statistic.Infact,themedianisthemostoutlier-resistantmetricinstatistics. Asgreatasthemedianis,it’sfarfrombeingperfecttodescribedatajustbyitsown.To seewhatImean,checkoutthethreedistributionsinthefollowingimage.Allthreehave thesamemeanandmedian,yetallthreeareverydifferentdistributions. Clearly,weneedtolooktootherstatisticalmeasurestodescribethesedifferences. Note Beforegoingontothenextchapter,checkoutthesummaryfunctioninR. Figure2.5:Threedistributionswiththesamemeanandmedian Spread Anotherverypopularquestionregardingunivariatedatais,Howvariablearethedata points?orHowspreadoutordispersedaretheobservations?Toanswerthesequestions, wehavetomeasurethespread,ordispersion,ofadatasample. Thesimplestwaytoanswerthatquestionistotakethesmallestvalueinthedatasetand subtractitbythelargestvalue.Thiswillgiveyoutherange.However,thissuffersfroma problemsimilartotheissueofthemean.Therangeinsalariesatthelawfirmwillvary widelydependingonwhethertheCEOisincludedintheset.Further,therangeisjust dependentontwovalues,thehighestandlowest,andtherefore,can’tspeakofthe dispersionofthebulkofthedataset. Onetacticthatsolvesthefirstoftheseproblemsistousetheinterquartilerange. Note Whataboutmeasuresofspreadforcategoricaldata? Themeasuresofspreadthatwetalkaboutinthissectionareonlyapplicabletonumeric data.Thereare,however,measuresofspreadordiversityofcategoricaldata.Inspiteof theusefulnessofthesemeasures,thistopicgoesunmentionedorblithelyignoredinmost dataanalysisandstatisticstexts.Thisisalongandvenerabletraditionthatwewill,forthe mostpart,adheretointhisbook.Ifyouareinterestedinlearningmoreaboutthis,search for‘DiversityIndices’ontheweb. Rememberwhenwesaidthatthemediansplitasorteddatasetintotwoequalparts,and thatitwasthe50thpercentilebecause50percentoftheobservationsfellbelowitsvalue? Inasimilarway,ifyouweretodivideasorteddatasetintofourequalparts,orquartiles, thethreevaluesthatmakethesedivideswouldbethefirst,second,andthirdquartiles respectively.Thesevaluescanalsobecalledthe25th,50th,and75thpercentiles.Notethat thesecondquartile,the50thpercentile,andthemedianareallequivalent. Theinterquartilerangeisthedifferencebetweenthethirdandfirstquartiles.Ifyouapply theinterquartilerangetoasampleofsalariesatthelawfirmthatincludestheCEO,the enormoussalarywillbediscardedwiththehighest25percentofthedata.However,this stillonlyusestwovalues,anddoesn’tspeaktothevariabilityofthemiddle50percent. Well,onewaywecanuseallthedatapointstoinformourspreadmetricisbysubtracting eachelementofadatasetfromthemeanofthedataset.Thiswillgiveusthedeviations,or residuals,fromthemean.Ifweaddupallthesedeviations,wewillarriveatthesumofthe deviationsfromthemean.Trytofindthesumofthedeviationsfromthemeaninthisset: {1,3,5,6,7}. Ifwetrytocomputethis,wenoticethatthepositivedeviationsarecancelledoutbythe negativedeviations.Inordertocopewiththis,weneedtotaketheabsolutevalue,orthe magnitudeofthedeviation,andsumthem. Thisisagreatstart,butnotethatthismetrickeepsincreasingifweaddmoredatatothe set.Becauseofthis,wemaywanttotaketheaverageofthesedeviations.Thisiscalled theaveragedeviation. Forthosehavingtroublefollowingthedescriptioninwords,theformulaforaverage deviationfromthemeanisthefollowing: whereµisthemean,Nisthenumberelementsofthesample,and istheithelementof thedataset.ItcanalsobeexpressedinRasfollows: >sum(abs(x-mean(x)))/length(x) Thoughaveragedeviationisanexcellentmeasureofspreadinitsownright,itsuseis commonly—andsometimesunfortunately—supplantedbytwoothermeasures. Insteadoftakingtheabsolutevalueofeachresidual,wecanachieveasimilaroutcomeby squaringeachdeviationfromthemean.This,too,ensuresthateachresidualispositive(so thatthereisnocancellingout).Additionally,squaringtheresidualshasthesometimes desirablepropertyofmagnifyinglargerdeviationsfromthemean,whilebeingmore forgivingofsmallerdeviations.Thesumofthesquareddeviationsiscalled(youguessed it!)thesumofsquareddeviationsfromthemeanor,simply,sumofsquares.Theaverage ofthesumofsquareddeviationsfromthemeanisknownasthevarianceandisdenoted by . Whenwesquareeachdeviation,wealsosquareourunits.Forexample,ifourdatasetheld measurementsinmeters,ourvariancewouldbeexpressedintermsofmeterssquared.To getbackouroriginalunits,wehavetotakethesquarerootofthevariance: Thisnewmeasure,denotedbyσ,isthestandarddeviation,anditisoneofthemost importantmeasuresinthisbook. Notethatweswitchedfromreferringtothemeanas toreferringitasµ.Thiswasnota mistake. Rememberthat wasthesamplemean,andµrepresentedthepopulationmean.The precedingequationsuseµtoillustratethattheseequationsarecomputingthespread metricsonthepopulationdataset,andnotonasample.Ifwewanttodescribethe varianceandstandarddeviationofasample,weusethesymbols andsinsteadof andσrespectively,andourequationschangeslightly: Insteadofdividingoursumofsquaresbythenumberofelementsintheset,wearenow dividingitbyn-1.Whatgives? Toanswerthatquestion,wehavetolearnalittlebitaboutpopulations,samples,and estimation. Populations,samples,andestimation Oneofthecoreideasofstatisticsisthatwecanuseasubsetofagroup,studyit,andthen makeinferencesorconclusionsaboutthatmuchlargergroup. Forexample,let’ssaywewantedtofindtheaverage(mean)weightofallthepeoplein Germany.Onewaydotothisistovisitallthe81millionpeopleinGermany,recordtheir weights,andthenfindtheaverage.However,itisafarmoresaneendeavortotakedown theweightsofonlyafewhundredGermans,andusethosetodeducetheaverageweight ofallGermans.Inthiscase,thefewhundredpeoplewedomeasureisthesample,andthe entiretyofpeopleinGermanyiscalledthepopulation. Now,thereareGermansofallshapesandsizes:someheavier,somelighter.Ifweonly pickafewGermanstoweigh,weruntheriskof,bychance,choosingagroupofprimarily underweightGermansoroverweightones.Wemightthencometoaninaccurate conclusionabouttheweightofallGermans.But,asweaddmoreGermanstooursample, thosechancevariationstendtobalancethemselvesout. Allthingsbeingequal,itwouldbepreferabletomeasuretheweightsofallGermansso thatwecanbeabsolutelysurethatwehavetherightanswer,butthatjustisn’tfeasible.If wetakealargeenoughsample,though,andarecarefulthatoursampleiswellrepresentativeofthepopulation,notonlycanwegetextraordinarilyclosetotheactual averageweightofthepopulation,butwecanquantifyouruncertainty.ThemoreGermans weincludeinoursample,thelessuncertainweareaboutourestimateofthepopulation. Intheprecedingcase,weareusingthesamplemeanasanestimatorofthepopulation mean,andtheactualvalueofthesamplemeaniscalledourestimate.Itturnsoutthatthe formulaforpopulationmeanisagreatestimatorofthemeanofthepopulationwhen appliedtoonlyasample.Thisiswhywemakenodistinctionbetweenthepopulationand samplemeans,excepttoreplacetheµwith .Unfortunately,thereexistsnoperfect estimatorforthestandarddeviationofapopulationforallpopulationtypes.Therewill alwaysbesomesystematicdifferenceintheexpectedvalueoftheestimatorandthereal valueofthepopulation.Thismeansthatthereissomebiasintheestimator.Fortunately, wecanpartiallycorrectit. Notethatthetwodifferencesbetweenthepopulationandthesamplestandarddeviation arethat(a)theµisreplacedby inthesamplestandarddeviation,and(b)thedivisornis replacedbyn-1. Inthecaseofthestandarddeviationofthepopulation,weknowthemeanµ.Inthecaseof thesample,however,wedon’tknowthepopulationmean,weonlyhaveanestimateofthe populationmeanbasedonthesamplemean .Thismustbetakenintoaccountand correctedinthenewequation.Nolongercanwedividebythenumberofelementsinthe dataset—wehavetodividebythedegreesoffreedom,whichisn-1. Note Whatintheworldaredegreesoffreedom?Andwhyisitn-1? Let’ssayweweregatheringapartyofsixtoplayaboardgame.Inthisboardgame,each playercontrolsoneofsixcoloredpawns.Peoplestarttojoininattheboard.Thefirst personattheboardgetstheirpickoftheirfavoritecoloredpawn.Thesecondplayerhas onelesspawntochoosefrom,butshestillhasachoiceinthematter.Bythetimethelast personjoinsinatthegametable,shedoesn’thaveachoiceinwhatpawnsheuses;sheis forcedtousethelastremainingpawn.Theconceptofdegreesoffreedomisalittlelike this. Ifwehaveagroupoffivenumbers,butholdthemeanofthosenumbersfixed,allbutthe lastnumbercanvary,becausethelastnumbermusttakeonthevaluethatwillsatisfythe fixedmean.Weonlyhavefourdegreesoffreedominthiscase. Moregenerally,thedegreesoffreedomisthesamplesizeminusthenumberofparameters estimatedfromthedata.Whenweareusingthemeanestimateinthestandarddeviation formula,weareeffectivelykeepingoneoftheparametersoftheformulafixed,sothat onlyn-1observationsarefreetovary.Thisiswhythedivisorofthesamplestandard deviationformulaisn-1;itisthedegreesoffreedomthatwearedividingby,notthe samplesize. Ifyouthoughtthatthelastfewparagraphswereheadyandtheoretical,you’reright.Ifyou areconfused,particularlybytheconceptofdegreesoffreedom,youcantakesolaceinthe factthatyouarenotalone;degreesoffreedom,bias,andsubtletiesofpopulationvs. samplestandarddeviationarenotoriouslyconfusingtopicsfornewcomerstostatistics. Butyouonlyhavetolearnitonlyonce! Probabilitydistributions Upuntilthispoint,whenwespokeofdistributions,wewerereferringtofrequency distributions.However,whenwetalkaboutdistributionslaterinthebook—orwhenother dataanalystsrefertothem—wewillbetalkingaboutprobabilitydistributions,whichare muchmoregeneral. It’seasytoturnacategorical,discrete,ordiscretizedfrequencydistributionintoa probabilitydistribution.Asanexample,refertothefrequencydistributionofcarburetors inthefirstimageinthischapter.InsteadofaskingWhatnumberofcarshavennumberof carburetors?,wecanask,Whatistheprobabilitythat,ifIchooseacaratrandom,Iwill getacarwithncarburetors? Wewilltalkmoreaboutprobability(anddifferentinterpretationsofprobability)in Chapter4,Probabilitybutfornow,probabilityisavaluebetween0and1(or0percent and100percent)thatmeasureshowlikelyaneventistooccur.Toanswerthequestion What’stheprobabilitythatIwillpickacarwith4carburetors?,theequationis: Youcanfindtheprobabilityofpickingacarofanyoneparticularnumberofcarburetors asfollows: >table(mtcars$carb)/length(mtcars$carb) 123468 0.218750.312500.093750.312500.031250.03125 Insteadofmakingabarchartofthefrequencies,wecanmakeabarchartofthe probabilities. Thisiscalledaprobabilitymassfunction(PMF).Itlooksthesame,butnowitmaps fromcarburetorstoprobabilities,notfrequencies.Figure2.6arepresentsthis. And,justasitiswiththebarchart,wecaneasilytellthat2and4arethenumberof carburetorsmostlikelytobechosenatrandom. Wecoulddothesamewithdiscretizednumericvariablesaswell.Thefollowingimages arearepresentationofthetemperaturehistogramasaprobabilitymassfunction. Figure2.6a:Probabilitymassfunctionofnumberofcarburetors Figure2.6b:ProbabilitymassfunctionofdailytemperaturemeasurementsfromMayto SeptemberinNY NotethatthisPMFonlydescribesthetemperaturesofNYCinthedatawehave. There’saproblemhere,though—thisPMFiscompletelydependentonthesizeofbins (ourmethodofdiscretizingthetemperatures).Imaginethatweconstructedthebinssuch thateachbinheldonlyonetemperaturewithinadegree.Inthiscase,wewouldn’tbeable totellverymuchfromthePMFatall,sinceeachspecificdegreeonlyoccursafewtimes, ifany,inthedataset.Thesameproblem—butworse!—happenswhenwetrytodescribe continuousvariableswithprobabilitieswithoutdiscretizingthematall.Imaginetryingto visualizetheprobability(orthefrequency)ofthetemperaturesiftheyweremeasuredto thethousandthplace(forexample,{90.167,67.361,..}).Therewouldbenovisible barsatall! Whatweneedhereisaprobabilitydensityfunction(PDF).Aprobabilitydensity functionwilltellustherelativelikelihoodthatwewillexperienceacertaintemperature. ThenextimageshowsaPDFthatfitsthetemperaturedatathatwe’vebeenplayingwith;it isanalogousto,butbetterthan,thehistogramwesawinthebeginningofthechapterand thePMFintheprecedingfigure. Thefirstthingyou’llnoticeaboutthisnewplotisthatitissmooth,notjaggedorboxylike thehistogramandPMFs.Thisshouldintuitivelymakemoresense,becausetemperatures areacontinuousvariable,andthereislikelytobenosharpcutoffsintheprobabilityof experiencingtemperaturesfromonedegreetothenext. Figure2.7:Threedistributionswiththesamemeanandmedian Thesecondthingyoushouldnoticeisthattheunitsandthevaluesontheyaxishave changed.Theyaxisnolongerrepresentsprobabilities—itnowrepresentsprobability densities.Thoughitmaybetempting,youcan’tlookatthisfunctionandanswerthe questionWhatistheprobabilitythatitwillbeexactly80degrees?.Technically,the probabilityofitbeing80.0000exactlyismicroscopicallysmall,almostzero.Butthat’s okay!Remember,wedon’tcarewhattheprobabilityofexperiencingatemperatureof 80.0000is—wejustcaretheprobabilityofatemperaturearoundthere. WecananswerthequestionWhat’stheprobabilitythatthetemperaturewillbebetweena particularrange?.Theprobabilityofexperiencingatemperature,say80to90degrees,is theareaunderthecurvefrom80to90.Thoseofyouunfortunatereaderswhoknow calculuswillrecognizethisastheintegral,oranti-derivative,ofthePDFevaluatedover therange, wheref(x)istheprobabilitydensityfunction. Thenextimageshowstheareaunderthecurveforthisrangeinpink.Youcanimmediately seethattheregioncoversalotofarea—perhapsonethird.AccordingtoR,it’sabout34 percent. >temp.density<-density(airquality$Temp) >pdf<-approxfun(temp.density$x,temp.density$y,rule=2) >integrate(pdf,80,90) 0.3422287withabsoluteerror<7.5e-06 Figure2.8:PDFwithhighlightedinterval Wedon’tgetaprobabilitydensityfunctionfromthesampleforfree.ThePDFhastobe estimated.ThePDFisn’tsomuchtryingtoconveytheinformationaboutthesamplewe haveasattemptingtomodeltheunderlyingdistributionthatgaverisetothatsample. Todothis,weuseamethodcalledkerneldensityestimation.Thespecificsofkernel densityestimationarebeyondthescopeofthisbook,butyoushouldknowthatthedensity estimationisheavilygovernedbyaparameterthatcontrolsthesmoothnessofthe estimation.Thisiscalledthebandwidth. Howdowechoosethebandwidth?Well,it’sjustlikechoosingthesizetomakethebinsin ahistogram:there’snorightanswer.It’sabalancingactbetweenreducingchanceornoise inthemodelandnotlosingimportantinformationbysmoothingoverpertinent characteristicsofthedata.Thisisatradeoffwewillseetimeandtimeagainthroughout thistext. Anyway,thegreatthingaboutPDFsisthatyoudon’thavetoknowcalculustointerpret PDFs.NotonlyarePDFsausefultoolanalytically,buttheymakeforatop-notch visualizationoftheshapeofdata. Note Bytheway… Rememberwhenweweretalkingaboutmodes,andIsaidthatfindingthemodeofnondiscretizedcontinuouslydistributeddataisamorecomplicatedprocedurethanfor discretizedorcategoricaldata?Themodeforthesetypesofunivariatedataisthepeakof thePDF.So,inthetemperatureexample,themodeisaround80degrees. Figure2.9:Threedifferentbandwidthsusedonthesamedata. Visualizationmethods Inanearlierimage,wesawthreeverydifferentdistributions,allwiththesamemeanand median.Isaidthenthatweneedtoquantifyvariancetotellthemapart.Inthefollowing image,therearethreeverydifferentdistributions,allwiththesamemean,median,and variance. Figure2.10:ThreePDFswiththesamemean,median,andstandarddeviation Ifyoujustrelyonbasicsummarystatisticstounderstandunivariatedata,you’llneverget thefullpicture.It’sonlywhenwevisualizeitthatwecanclearlysee,ataglance,whether thereareanyclustersorareaswithahighdensityofdatapoints,thenumberofclusters thereare,whetherthereareoutliers,whetherthereisapatterntotheoutliers,andsoon. Whendealingwithunivariatedata,theshapeisthemostimportantpart(that’swhythis chapteriscalledShapeofData!). Wewillbeusingggplot2’sqplotfunctiontoinvestigatetheseshapesandvisualizethese data.qplot(forquickplot)isthesimplercousinofthemoreexpressiveggplotfunction. qplotmakesiteasytoproducehandsomeandcompellinggraphicsusingconsistent grammar.Additionally,muchoftheskills,lessons,andknow-howfromqplotare transferrabletoggplot(forwhenwehavetogetmoreadvanced). Note What’sggplot2?Whyareweusingit? ThereareafewplottingmechanismsforR,includingthedefaultonethatcomeswithR (calledbaseR).However,ggplot2seemstobealotofpeople’sfavorite.Thisisnot unwarranted,givenitswideuse,excellentdocumentation,andconsistentgrammar. SincethebaseRgraphicssubsystemiswhatIlearnedtowieldfirst,I’vebecomeadeptat usingit.TherearecertaintypesofplotsthatIproducefasterusingbaseR,soIstilluseit onaregularbasis(Figure2.8toFigure2.10weremadeusingbaseR!). Thoughwewillbeusingggplot2forthisbook,feelfreetogoyourownwaywhen makingyourveryownplots. Mostofthegraphicsinthissectionaregoingtotakethefollowingform: >qplot(column,data=dataframe,geom=...) wherecolumnisaparticularcolumnofthedataframedataframe,andthegeomkeyword argumentspecifiesageometricobject—itwillcontrolthetypeofplotthatwewant.For visualizingunivariatedata,wedon’thavemanyoptionsforgeom.Thethreetypesthatwe willbeusingarebar,histogram,anddensity.Makingabargraphofthefrequency distributionofthenumberofcarburetorscouldn’tbeeasier: >library(ggplot2) >qplot(factor(carb),data=mtcars,geom="bar") Figure2.11:Frequencydistributionofthenumberofcarburetors Usingthefactorfunctiononthecarbcolumnmakestheplotlookbetterinthiscase. Wecould,ifwewantedto,makeanunattractiveanddistractingplotbycoloringallthe barsadifferentcolor,asfollows: >qplot(factor(carb), +data=mtcars, +geom="bar", +fill=factor(carb), +xlab="numberofcarburetors") Figure2.12:Withcolorandlabelmodification Wealsorelabeledthexaxis(whichisautomaticallysetbyqplot)tomoreinformative text. It’sjustaseasytomakeahistogramofthetemperaturedata—themaindifferenceisthat weswitchgeomfrombartohistogram: >qplot(Temp,data=airquality,geom="histogram") Figure2.13:Histogramoftemperaturedata Whydoesn’titlooklikethefirsthistograminthebeginningofthechapter,youask?Well, that’sbecauseoftworeasons: Iadjustedthebinwidth(sizeofthebins) Iaddedcolortotheoutlineofthebars ThecodeIusedforthefirsthistogramlookedasfollows: >qplot(Temp,data=airquality,geom="histogram", +binwidth=5,color=I("white")) MakingplotsoftheapproximationofthePDFaresimilarlysimple: >qplot(Temp,data=airquality,geom="density") Figure2.14:PDFoftemperaturedata Byitself,Ithinktheprecedingplotisratherunattractive.Wecangiveitalittlemoreflair by: Fillingthecurvepink Addingalittletransparencytothefill >qplot(Temp,data=airquality,geom="density", +adjust=.5,#changesbandwidth +fill=I("pink"), +alpha=I(.5),#addstransparency +main="densityplotoftemperaturedata") Figure2.15:Figure2.14withmodifications Nowthat’sahandsomeplot! Noticethatwealsomadethebandwidthsmallerthanthedefault(1,whichmadethePDF moresquiggly)andaddedatitletotheplotwiththemainfunction. Exercises Hereareafewexercisesforyoutorevisetheconceptslearnedinthischapter: WriteanRfunctiontocomputetheinterquartilerange. Learnaboutwindorized,geometric,harmonic,andtrimmedmeans.Towhatextent dothesemetricssolvetheproblemofthenon-robustnessofthearithmeticmean? CraftanassessmentofVirginiaWoolf’simpactonfemininediscourseinthe20th century.Besuretoaddressbothprosaicandlyricalformsinyourresponse. Summary Oneofthehardestthingsaboutdataanalysisisstatistics,andoneofthehardestthings aboutstatistics(notunlikecomputerprogramming)isthatthebeginningisthetoughest hurdle,becausetheconceptsaresonewandunfamiliar.Asaresult,somemightfindthis tobeoneofthemorechallengingchaptersinthistext. However,hardworkduringthisphasepaysenormousdividends;itprovidesasturdy foundationonwhichtopileonandorganizenewknowledge. Torecap,inthischapter,welearnedaboutunivariatedata.Wealsolearnedabout: Thetypesofunivariatedata Howtomeasurethecentraltendencyofthesedata Howtomeasurethespreadofthesedata Howtovisualizetheshapeofthesedata Alongtheway,wealsolearnedalittlebitaboutprobabilitydistributionsand population/samplestatistics. I’mgladyoumadeitthrough!Relax,makeyourselfamocktail,andI’llseeyouat Chapter3,DescribingRelationshipsshortly! Chapter3.DescribingRelationships Istherearelationshipbetweensmokingandlungcancer?Dopeoplewhocarefordogs livelonger?Isyouruniversity’sadmissionsdepartmentsexist? Tacklingtheseexcitingquestionsisonlypossiblewhenwetakeastepbeyondsimply describingunivariatedatasets—onestepbeyond! Multivariatedata Inthischapter,wearegoingtodescriberelationships,andbeginworkingwith multivariatedata,whichisafancywayofsayingsamplescontainingmorethanone variable. Thetroublemakerreadermightremarkthatallthedatasetsthatwe’veworkedwiththus far(mtcarsandairquality)havecontainedmorethanonevariable.Thisistechnically true—butonlytechnically.Thefactofthematteristhatwe’veonlybeenworkingwith oneofthedataset’svariablesatanyonetime.Notethatmultivariateanalyticsisnotthe sameasdoingunivariateanalyticsonmorethanonevariable–multivariateanalysesand describingrelationshipsinvolveseveralvariablesatthesametime. Toputthismoreconcretely,inthelastchapterwedescribedtheshapeof,say,the temperaturereadingsintheairqualitydataset. >head(airquality) OzoneSolar.RWindTempMonthDay 1411907.46751 2361188.07252 31214912.67453 41831311.56254 5NANA14.35655 628NA14.96656 Inthischapter,wewillbeexploringwhetherthereisarelationshipbetweentemperature andthemonthinwhichthetemperaturewastaken(spoileralert:thereis!). Thekindofmultivariateanalysisyouperformisheavilyinfluencedbythetypeofdata thatyouareworkingwith.Therearethreebroadclassesofbivariate(ortwovariable) relationships: Therelationshipbetweenonecategoricalvariableandonecontinuousvariable Therelationshipbetweentwocategoricalvariables Therelationshipbetweentwocontinuousvariables Wewillgetintoalloftheseinthenextthreesections.Inthesectionafterthat,wewill touchondescribingtherelationshipsbetweenmorethantwovariables.Finally,following inthetraditionofthepreviouschapter,wewillendwithasectiononhowtocreateyour ownplotstocapturetherelationshipsthatwe’llbeexploring. Relationshipsbetweenacategoricalanda continuousvariable Describingtherelationshipbetweencategoricalandcontinuousvariablesisperhapsthe mostfamiliarofthethreebroadcategories. WhenIwasinthefifthgrade,myclasshadtoparticipateinanarea-widesciencefair.We weretodeviseourownexperiment,performit,andthenpresentit.Forsomereason,in myexperimentIchosetowatersomelentilsproutswithtapwaterandsomewithalcohol toseeiftheygrewdifferently. WhenImeasuredtheheightsandcomparedthemeasurementsoftheteetotallerlentils versusthedrunkenlentils,Iwaspointingoutarelationshipbetweenacategoricalvariable (alcohol/no-alcohol)andacontinuousvariable(heightsoftheseedlings). Note NotethatIwasn’ttryingtomakeabroaderstatementabouthowalcoholaffectsplant growth.Inthegrade-schoolexperiment,Iwasjustsummarizingthedifferencesinthe heightsofthoseplants—theonesthatwereintheexperiment.Inordertomakestatements ordrawconclusionsabouthowalcoholaffectsplantgrowthingeneral,wewouldbe exitingtherealmofexploratorydataanalysisandenteringthedomainofinferential statistics,whichwewilldiscussinthenextunit. Thealcoholcouldhavemadethelentilsgrowfaster(itdidn’t),growslower(itdid),or growatthesamerateasthetapwaterlentils.Allthreeofthesepossibilitiesconstitutea relationship:greaterthan,lessthan,orequalto. TodemonstratehowtouncovertherelationshipbetweenthesetwotypesofvariablesinR, wewillbeusingtheirisdatasetthatisconvenientlybuiltrightintoR. >head(iris) Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies 15.13.51.40.2setosa 24.93.01.40.2setosa 34.73.21.30.2setosa 44.63.11.50.2setosa 55.03.61.40.2setosa 65.43.91.70.4setosa Thisisafamousdatasetandisusedtodayprimarilyforteachingpurposes.Itgivesthe lengthsandwidthsofthepetalsandsepals(anotherpartoftheflower)of150Irisflowers. Ofthe150flowers,ithas50measurementseachfromthreedifferentspeciesofIris flowers:setosa,versicolor,andvirginica. Bynow,weknowhowtotakethemeanofallthepetallengths: >mean(iris$Petal.Length) [1]3.758 Butwecouldalsotakethemeanofthepetallengthsofeachofthethreespeciestoseeif thereisanydifferenceinthemeans. Naively,onemightapproachthistaskinRasfollows: >mean(iris$Petal.Length[iris$Species=="setosa"]) [1]1.462 >mean(iris$Petal.Length[iris$Species=="versicolor"]) [1]4.26 >mean(iris$Petal.Length[iris$Species=="virginica"]) [1]5.552 But,asyoumightimagine,thereisafareasierwaytodothis: >by(iris$Petal.Length,iris$Species,mean) iris$Species:setosa [1]1.462 -------------------------------------------iris$Species:versicolor [1]4.26 -------------------------------------------iris$Species:virginica [1]5.552 byisahandyfunctionthatappliesafunctiontosplitthesubsetsofdata.Inthiscase,the Petal.Lengthvectorisdividedintothreesubsetsforeachspecies,andthenthemean functioniscalledoneachofthosesubsets.Itappearsasifthesetosasinthissamplehave wayshorterpetalsthantheothertwospecies,withthevirginicasamples’petallength beatingoutversicolor’sbyasmallermargin. Althoughmeansareprobablythemostcommonstatistictobecomparedbetween categories,itisnottheonlystatisticwecanusetocompare.Ifwehadreasontobelieve thatthevirginicashaveamorewidelyvaryingpetallengththantheothertwospecies,we couldpassthesdfunctiontothebyfunctionasfollows: >by(iris$Petal.Length,iris$Species,sd) Mostoften,though,wewanttobeabletocomparemanystatisticsbetweengroupsatone time.Tothisend,it’sverycommontopassinthesummaryfunction: >by(iris$Petal.Length,iris$Species,summary) iris$Species:setosa Min.1stQu.MedianMean3rdQu.Max. 1.0001.4001.5001.4621.5751.900 -----------------------------------------------iris$Species:versicolor Min.1stQu.MedianMean3rdQu.Max. 3.004.004.354.264.605.10 -----------------------------------------------iris$Species:virginica Min.1stQu.MedianMean3rdQu.Max. 4.5005.1005.5505.5525.8756.900 Ascommonasthisidiomis,itstillpresentsuswithalotofdenseinformationthatis difficulttomakesenseofataglance.Itismorecommonstilltovisualizethedifferences incontinuousvariablesbetweencategoriesusingabox-and-whiskerplot: Figure3.1:Abox-and-whiskerplotdepictingtherelationshipbetweenthepetallengthsof thedifferentirisspeciesinirisdataset Abox-and-whiskerplot(orsimply,aboxplotifyouhaveplacestogo,andyou’reina rush)displaysastunninglylargeamountofinformationinasinglechart.Eachcategorical variablehasitsownboxandwhiskers.Thebottomandtopendsoftheboxrepresentthe firstandthirdquartilerespectively,andtheblackbandinsidetheboxisthemedianfor thatgroup,asshowninthefollowingfigure: Figure3.2:Theanatomyofaboxplot Dependingonwhomyoutalktoandwhatyouusetoproduceyourplots,theedgesofthe whiskerscanmeanafewdifferentthings.Inmyfavoritevariation(calledTukey’s variation),thebottomofthewhiskersextendtothelowestdatumwithin1.5timesthe interquartilerangebelowthebottomofthebox.Similarly,theverytopofthewhisker representsthehighestdatum1.5interquartilerangesabovethethirdquartile(remember: interquartilerangeisthethirdquartileminusthefirst).Thisis,coincidentally,the variationthatggplot2uses. Thegreatthingaboutboxplotsisthatnotonlydowegetagreatsenseofthecentral tendencyanddispersionofthedistributionwithinacategory,butwecanalsoimmediately spottheimportantdifferencesbetweeneachcategory. Fromtheboxplotinthepreviousimage,it’seasytotellwhatwealreadyknowaboutthe centraltendencyofthepetallengthsbetweenspecies:thatthesetosasinthissamplehave theshortestpetals;thatthevirginicahavethelongestonaverage;andthatversicolorsare inthemiddle,butareclosertothevirginicas. Inaddition,wecanseethatthesetosashavethethinnestdispersion,andthatthevirginica havethehighest—whenyoudisregardtheoutlier. Butremember,wearenotsayinganything,ordrawinganyconclusionsyetaboutIris flowersingeneral.Inalloftheseanalyses,wearetreatingallthedatawehaveasthe populationofinterest;inthisexample,the150flowersmeasuredareourpopulationof interest. Beforewemoveontothenextbroadcategoryofrelationships,let’slookatthe airqualitydataset,treatthemonthasthecategoricalvariable,thetemperatureasthe continuousvariable,andseeifthereisarelationshipbetweentheaveragetemperature acrossmonths. >by(airquality$Temp,airquality$Month,mean) airquality$Month:5 [1]65.54839 --------------------------------------------airquality$Month:6 [1]79.1 --------------------------------------------airquality$Month:7 [1]83.90323 --------------------------------------------airquality$Month:8 [1]83.96774 --------------------------------------------airquality$Month:9 [1]76.9 ThisispreciselywhatwewouldexpectfromacityintheNorthernhemisphere: Figure3.3:ABoxplotofNYCtemperaturesacrossmonths(MaytoSeptember) Relationshipsbetweentwocategorical variables Describingtherelationshipsbetweentwocategoricalvariablesisdonesomewhatless oftenthantheothertwobroadtypesofbivariateanalyses,butitisjustasfun(anduseful)! Toexplorethistechnique,wewillbeusingthedatasetUCBAdmissions,whichcontains thedataongraduateschoolapplicantstotheUniversityofCaliforniaBerkeleyin1973. Beforewegetstarted,wehavetowrapthedatasetinacalltodata.frameforcoercingit intoadataframetypevariable—I’llexplainwhy,soon. ucba<-data.frame(UCBAdmissions) >head(ucba) AdmitGenderDeptFreq 1AdmittedMaleA512 2RejectedMaleA313 3AdmittedFemaleA89 4RejectedFemaleA19 5AdmittedMaleB353 6RejectedMaleB207 Now,whatwewantisacountofthefrequenciesofnumberofstudentsineachofthe followingfourcategories: Acceptedfemale Rejectedfemale Acceptedmale Rejectedmale Doyourememberthefrequencytabulationatthebeginningofthelastchapter?Thisis similar—exceptthatnowwearedividingthesetbyonemorevariable.Thisisknownas cross-tabulationorcrosstab.Itisalsosometimesreferredtoasacontingencytable.The reasonwehadtocoerceUCBAdmissionsintoadataframeisbecauseitwasalreadyinthe formofacrosstabulation(exceptthatitfurtherbrokethedatadownintothedifferent departmentsofthegradschool).CheckitoutbytypingUCBAdmissionsattheprompt. WecanusethextabsfunctioninRtomakeourowncross-tabulations: #thefirstargumenttoxtabs(theformula)should #bereadas:frequency*by*GenderandAdmission >cross<-xtabs(Freq~Gender+Admit,data=ucba) >cross Admit GenderAdmittedRejected Male11981493 Female5571278 Here,ataglance,wecanseethattherewere1198malesthatwereadmitted,557females thatwereadmitted,andsoon. IsthereagenderbiasinUCB’sgraduateadmissionsprocess?Perhaps,butit’shardtotell fromjustlookingatthe2x2contingencytable.Sure,therearefewerfemalesaccepted thanmales,buttherearealso,unfortunately,farfewerfemalesthatappliedtoUCBinthe firstplace. ToaidusineitherimplicatingUCBofasexistadmissionsmachineorexoneratingthem,it wouldhelptolookataproportionstable.Usingaproportionstable,wecaneasily comparetheproportionofthetotalnumberofmaleswhowereacceptedversusthe proportionofthetotalnumberoffemaleswhowereaccepted.Iftheproportionsaremore orlessequal,wecanconcludethatgenderdoesnotconstituteafactorinUCB’s admissionsprocess.Ifthisisthecase,genderandadmissionstatusissaidtobe conditionallyindependent. >prop.table(cross,1) Admit GenderAdmittedRejected Male0.44518770.5548123 Female0.30354220.6964578 Note Whydidwesupply1asanargumenttoprop.table?LookupthedocumentationattheR prompt.Whenwouldwewanttouseprop.table(cross,2)? Here,wecanseethatwhile45percentofthemaleswhoappliedwereaccepted,only30 percentofthefemaleswhoappliedwereaccepted.Thisisevidencethattheadmissions departmentissexist,right?Notsofast,myfriend! ThisispreciselywhatalawsuitlodgedagainstUCBpurported.Whentheissuewas lookedintofurther,itwasdiscoveredthat,atthedepartmentlevel,womenandmen actuallyhadsimilaradmissionsrates.Infact,someofthedepartmentsappearedtohavea smallbutsignificantbiasinfavorofwomen.CheckoutdepartmentA’sproportiontable, forexample: >cross2<-xtabs(Freq~Gender+Admit,data=ucba[ucba$Dept=="A",]) >prop.table(cross2,1) Admit GenderAdmittedRejected Male0.62060610.3793939 Female0.82407410.1759259 Iftherewereanybiasinadmissions,thesedatadidn’tproveit.Thisphenomenon,wherea trendthatappearsincombinedgroupsofdatadisappearsorreverseswhenbrokendown intogroupsisknownasSimpson’sParadox.Inthiscase,itwascausedbythefactthat womentendedtoapplytodepartmentsthatwerefarmoreselective. ThisisprobablythemostfamouscaseofSimpson’sParadox,anditisalsowhythis datasetisbuiltintoR.Thelessonhereistobecarefulwhenusingpooleddata,andlook outforhiddenvariables. Therelationshipbetweentwocontinuous variables Doyouthinkthatthereisarelationshipbetweenwomen’sheightsandtheirweights?If yousaidyes,congratulations,you’reright! WecanverifythisassertionbyusingthedatainR’sbuilt-indataset,women,whichholds theheightandweightof15Americanwomenfromages30to39. >head(women) heightweight 158115 259117 360120 461123 562126 663129 >nrow(women) [1]15 Specifically,thisrelationshipisreferredtoasapositiverelationship,becauseasoneofthe variableincreases,weexpectanincreaseintheothervariable. Themosttypicalvisualrepresentationoftherelationshipbetweentwocontinuous variablesisascatterplot. Ascatterplotisdisplayedasagroupofpointswhosepositionalongthex-axisis establishedbyonevariable,andthepositionalongthey-axisisestablishedbytheother. Whenthereisapositiverelationship,thedots,forthemostpart,startinthelower-left cornerandextendtotheupper-rightcorner,asshowninthefollowingfigure.Whenthere isanegativerelationship,thedotsstartintheupper-leftcornerandextendtothelowerrightone.Whenthereisnorelationship,itwilllookasifthedotsareallovertheplace. Figure3.4:Scatterplotofwomen’sheightsandweights Themorethedotslookliketheyformastraightline,thestrongeristherelationship betweentwocontinuousvariablesissaidtobe;themorediffusethepoints,theweakeris therelationship.Thedotsintheprecedingfigurelookalmostexactlylikeastraightline— thisisprettymuchasstrongarelationshipastheycome. Thesekindsofrelationshipsarecolloquiallyreferredtoascorrelations. Covariance Asalways,visualizationsaregreat—necessary,even—butonmostoccasions,weare goingtoquantifythesecorrelationsandsummarizethemwithnumbers. Thesimplestmeasureofcorrelationthatiswidelyuseisthecovariance.Foreachpairof valuesfromthetwovariables,thedifferencesfromtheirrespectivemeansaretaken.Then, thosevaluesaremultiplied.Iftheyarebothpositive(thatis,boththevaluesareabove theirrespectivemeans),thentheproductwillbepositivetoo.Ifboththevaluesarebelow theirrespectivemeans,theproductisstillpositive,becausetheproductoftwonegative numbersispositive.Onlywhenoneofthevaluesisaboveitsmeanwilltheproductbe negative. Remember,insamplestatisticswedividebythedegreesoffreedomandnotthesample size.Notethatthismeansthatthecovarianceisonlydefinedfortwovectorsthathavethe samelength. WecanfindthecovariancebetweentwovariablesinRusingthecovfunction.Let’sfind thecovariancebetweentheheightsandweightsinthedataset,women: >cov(women$weight,women$height) [1]69 #theorderweputthetwocolumnsin #theargumentsdoesn'tmatter >cov(women$height,women$weight) [1]69 Thecovarianceispositive,whichdenotesapositiverelationshipbetweenthetwo variables. Thecovariance,byitself,isdifficulttointerpret.Itisespeciallydifficulttointerpretinthis case,becausethemeasurementsusedifferentscales:inchesandpounds.Itisalsoheavily dependentonthevariabilityineachvariable. Considerwhathappenswhenwetakethecovarianceoftheweightsinpoundsandthe heightsincentimeters. #thereare2.54centimetersineachinch #changingtheunitstocentimetersincreases #thevariabilitywithintheheightvariable >cov(women$height*2.54,women$weight) [1]175.26 Semanticallyspeaking,therelationshiphasn’tchanged,sowhyshouldthecovariance? Correlationcoefficients AsolutiontothisquirkofcovarianceistousePearson’scorrelationcoefficientinstead. Outsideitscolloquialcontext,whenthewordcorrelationisuttered—especiallyby analysts,statisticians,orscientists—itusuallyreferstoPearson’scorrelation. Pearson’scorrelationcoefficientisdifferentfromcovarianceinthatinsteadofusingthe sumoftheproductsofthedeviationsfromthemeaninthenumerator,itusesthesumof theproductsofthenumberofstandarddeviationsawayfromthemean.Thesenumber-ofstandard-deviations-from-the-meanarecalledz-scores.Ifavaluehasaz-scoreof1.5,itis 1.5standarddeviationsabovethemean;ifavaluehasaz-scoreof-2,thenitis2standard deviationsbelowthemean. Pearson’scorrelationcoefficientisusuallydenotedbyranditsequationisgivenas follows: whichisthecovariancedividedbytheproductofthetwovariables’standarddeviation. Animportantconsequenceofusingstandardizedz-scoresinsteadofthemagnitudeof distancefromthemeanisthatchangingthevariabilityinonevariabledoesnotchangethe correlationcoefficient.Nowyoucanmeaningfullycomparevaluesusingtwodifferent scalesoreventwodifferentdistributions.Thecorrelationbetweenweight/height-in-inches andweight/height-in-centimeterswillnowbeidentical,becausemultiplicationwith2.54 willnotchangethez-scoresofeachheight. >cor(women$height,women$weight) [1]0.9954948 >cor(women$height*2.54,women$weight) [1]0.9954948 Anotherimportantandhelpfulconsequenceofthisstandardizationisthatthemeasureof correlationwillalwaysrangefrom-1to1.APearsoncorrelationcoefficientof1will denoteaperfectlypositive(linear)relationship,arof-1willdenoteaperfectlynegative (linear)relationship,andarof0willdenoteno(linear)relationship. Whythelinearqualificationinparentheses,though? Intuitively,thecorrelationcoefficientshowshowwelltwovariablesaredescribedbythe straightlinethatfitsthedatamostclosely;thisiscalledaregressionortrendline.Ifthere isastrongrelationshipbetweentwovariables,buttherelationshipisnotlinear,itcannot berepresentedaccuratelybyPearson’sr.Forexample,thecorrelationbetween1to100 and100to200is1(becauseitisperfectlylinear),butacubicrelationshipisnot: >xs<-1:100 >cor(xs,xs+100) [1]1 >cor(xs,xs^3) [1]0.917552 Itisstillabout0.92,whichisanextremelystrongcorrelation,butnotthe1thatyoushould expectfromaperfectcorrelation. SoPearson’srassumesalinearrelationshipbetweentwovariables.Thereare,however, othercorrelationcoefficientsthataremoretolerantofnon-linearrelationships.Probably themostcommonoftheseisSpearman’srankcoefficient,alsocalledSpearman’srho. Spearman’srhoiscalculatedbytakingthePearsoncorrelationnotofthevalues,butof theirranks. Note What’sarank? Whenyouassignrankstoavectorofnumbers,thelowestnumbergets1,thesecond lowestgets2,andsoon.Thehighestdatuminthevectorgetsarankthatisequaltothe numberofelementsinthatvector. Inrankings,themagnitudeofthedifferenceinvaluesoftheelementsisdisregarded. Consideraracetoafinishlineinvolvingthreecars.Let’ssaythatthewinnerinthefirst placefinishedataspeedthreetimesthatofthecarinthesecondplace,andthecarinthe secondplacebeatthecarinthethirdplacebyonlyafewseconds.Thedriverofthecar thatcamefirsthasagoodreasontobeproudofherself,butherrank,1stplace,doesnot sayanythingabouthowsheeffectivelycleanedthefloorwiththeothertwocandidates. TryusingR’srankfunctiononthevectorc(8,6,7,5,3,0,9).Nowtryitonthe vectorc(8,6,7,5,3,-100,99999).Therankingsarethesame,right? Whenweuseranksinstead,thepairthathasthehighestvalueonboththexandtheyaxis willbec(1,1),evenifonevariableisanon-linearfunction(cubed,squared,logarithmic, andsoon)oftheother.ThecorrelationsthatwejusttestedwillbothhaveSpearmanrhos of1,becausecubingavaluewillnotchangeitsrank. >xs<-1:100 >cor(xs,xs+100,method="spearman") [1]1 >cor(xs,xs^3,method="spearman") [1]1 Figure3.5:Scatterplotofy=x+100withregressionline.randrhoareboth1 Figure3.6:Scatterplotof withregressionline.ris.92,butrhois1 Let’susewhatwe’velearnedsofartoinvestigatethecorrelationbetweentheweightofa carandthenumberofmilesitgetstothegallon.Doyoupredictanegativerelationship (theheavierthecar,thelowerthemilespergallon)? >cor(mtcars$wt,mtcars$mpg) [1]-0.8676594 Figure3.7:Scatterplotoftherelationshipbetweentheweightofacaranditsmilesper gallon Thatisastrongnegativerelationship.Although,intheprecedingfigure,notethatthedata pointsaremorediffuseandspreadaroundtheregressionlinethanintheotherplots;this indicatesasomewhatweakerrelationshipthanwehaveseenthusfar. Foranevenweakerrelationship,checkoutthecorrelationbetweenwindspeedand temperatureintheairqualitydatasetasdepictedinthefollowingimage: >cor(airquality$Temp,airquality$Wind) [1]-0.4579879 >cor(airquality$Temp,airquality$Wind,method="spearman") [1]-0.4465408 Figure3.8:Scatterplotoftherelationshipbetweenwindspeedandtemperature Comparingmultiplecorrelations Armedwithournewstandardizedcoefficients,wecannoweffectivelycomparethe correlationsbetweendifferentpairsofvariablesdirectly. Indataanalysis,itiscommontocomparethecorrelationsbetweenallthenumeric variablesinasingledataset.WecandothiswiththeirisdatasetusingthefollowingR codesnippet: >#havetodrop5thcolumn(speciesisnotnumeric) >iris.nospecies<-iris[,-5] >cor(iris.nospecies) Sepal.LengthSepal.WidthPetal.LengthPetal.Width Sepal.Length1.0000000-0.11756980.87175380.8179411 Sepal.Width-0.11756981.0000000-0.4284401-0.3661259 Petal.Length0.8717538-0.42844011.00000000.9628654 Petal.Width0.8179411-0.36612590.96286541.0000000 Thisproducesacorrelationmatrix(whenitisdonewiththecovariance,itiscalleda covariancematrix).Itissquare(thesamenumberofrowsandcolumns)andsymmetric, whichmeansthatthematrixisidenticaltoitstransposition(thematrixwiththeaxes flipped).Itissymmetrical,becausetherearetwoelementsforeachpairofvariableson eithersideofthediagonallineof1s.Thediagonallineisall1’s,becauseeveryvariableis perfectlycorrelatedwithitself.Whicharethemosthighly(positively)correlatedpairsof variables?Whataboutthemostnegativelycorrelated? Visualizationmethods Wearenowgoingtoseehowwecancreatethesekindsofvisualizationsonourown. Categoricalandcontinuousvariables Wehaveseenthatboxplotsareagreatwayofcomparingthedistributionofacontinuous variableacrossdifferentcategories.Asyoumightexpect,boxplotsareveryeasyto produceusingggplot2.Thefollowingsnippetproducesthebox-and-whiskerplotthatwe sawearlier,depictingtherelationshipbetweenthepetallengthsofthedifferentirisspecies intheirisdataset: >library(ggplot) >qplot(Species,Petal.Length,data=iris,geom="boxplot", +fill=Species) First,wespecifythevariableonthex-axis(theirisspecies)andthenthecontinuous variableonthey-axis(thepetallength).Finally,wespecifythatweareusingtheiris dataset,thatwewantaboxplot,andthatwewanttofilltheboxeswithdifferentcolorsfor eachirisspecies. Anotherfunwayofcomparingdistributionsbetweenthedifferentcategoriesisbyusing anoverlappingdensityplot: >qplot(Petal.Length,data=iris,geom="density",alpha=I(.7), +fill=Species) Hereweneedonlyspecifythecontinuousvariable,sincethefillparameterwillbreak downthedensityplotbyspecies.Thealphaparameteraddstransparencytoshowmore clearlytheextenttowhichthedistributionsoverlap. Figure3.9:Overlappingdensityplotofpetallengthofirisflowersacrossspecies Ifitisnotthedistributionyouaretryingtocomparebutsomekindofsingle-valuestatistic (likestandarddeviationorsamplecounts),youcanusethebyfunctiontogetthatvalue acrossallcategories,andthenbuildabarplotwhereeachcategoryisabar,andthe heightsofthebarsrepresentthatcategory’sstatistic.Forthecodetoconstructabarplot, referbacktothelastsectioninChapter1,RefresheR. Twocategoricalvariables Thevisualizationofcategoricaldataisagrosslyunderstudieddomainand,inspiteof somefairlypowerfulandcompellingvisualizationmethods,thesetechniquesremain relativelyunpopular. Myfavoritemethodforgraphicallyillustratingcontingencytablesistouseamosaicplot. Tomakemosaicplots,wewillneedtoinstallandloadtheVCD(VisualizingCategorical Data)package: >#install.packages("vcd") >library(vcd) > >ucba<-data.frame(UCBAdmissions) >mosaic(Freq~Gender+Admit,data=ucba, +shade=TRUE,legend=FALSE) Figure3.10:AmosaicplotoftheUCBAdmissionsdataset(acrossalldepartments) Thefirstargumenttothemosaicfunctionisaformula.Thisformulaismeanttoberead as:displayfrequencybrokendownbygenderandwhethertheapplicantwasadmitted. shade=TRUEaddsalittlelifetotheplotbyaddingcolorstotheboxes.Thecolorsare actuallyverymeaningful,asisthelegendweoptednottoshowwiththefinalparameter— butitsmeaningisbeyondthescopeofthissection. Themosaicplotrepresentseachcellofa2x2contingencytableasatile;theareaofthe boxisproportionaltothenumberofobservationsinthatcell.Fromthisplot,wecaneasily tellthat(a)moremenappliedtoUCBthanwomen,(b)moreapplicantswererejectedthan accepted,and(c)womenwererejectedatahigherproportionthanmaleapplicants. Yourememberhowthiswasmisleading,right?Let’slookatthemosaicplotforonly departmentA: >mosaic(Freq~Gender+Admit,data=ucba[ucba$Dept=="A",], +shade=TRUE,legend=FALSE) Figure3.11:AmosaicplotoftheUCBAdmissionsdatasetfordepartmentA Hopefully,thisplotmakesthetreacheryofSimpson’sparadoxmoreapparent.Noticehow therewerefarfewerfemaleapplicantsthanmales,buttheadmissionratesforthefemale applicantsweremuchhigher.Tryvisualizingthemosaicplotsfortheotherdepartmentsby yourself! Twocontinuousvariables Thecanonicalwayofdisplayingrelationshipsbetweentwocontinuousvariablesisvia scatterplots.Thescatterplotforthewomen’sheightsandweightsthatwesawearlierin thischapterwasproducedwiththefollowingRcodesnippet: >qplot(height,weight,data=women,geom="point") Whetheryouputheightandweightfirstdependsonwhichvariableyouwanttiedtothe x-axis. Whataboutthatfancyregressionline?!,youaskfrantically.ggplot2gracefullyprovides thisfeaturewithjustafewextracharacters.Thescatterplotoftherelationshipbetweenthe weightofacaranditsmilespergallonwasproducedasfollows: >qplot(wt,mpg,data=mtcars,geom=c("point","smooth"), +method="lm",se=FALSE)' Here,wearespecifyingthatwewanttwokindsofgeometricobjects,pointandsmooth. Thelatterisresponsiblefortheregressionline.method="lm"tellsqplotthatwewantto usealinearmodeltocreatethetrendline. Ifweleaveoutthemethod,ggplot2willchooseamethodautomatically;inthiscase,it woulddefaulttoamethodofdrawinganon-lineartrendlinecalledLOESS: >qplot(wt,mpg,data=mtcars,geom=c("point","smooth"),se=FALSE) Figure3.12:Ascatterplotoftherelationshipbetweentheweightofacaranditsmilesper gallon,andatrend-linesmoothedwithLOESS These=FALSEdirectiveinstructsggplot2nottoplottheestimatesoftheerror.Wewillget towhatthismeansinalaterchapter. Morethantwocontinuousvariables Finally,thereisanexcellentwaytovisualizecorrelationmatricesliketheonewesaw withtheirisdatasetinthesectionComparingmultiplecorrelations.Todothis,wehave toinstallandloadthecorrgrampackageasfollows: >#install.packages("corrgram") >library(corrgram) > >corrgram(iris,lower.panel=panel.conf,upper.panel=panel.pts) Figure3.13:Acorrgramoftheirisdataset’scontinuousvariables Withcorrgrams,wecanexploitthefactthecorrelationmatricesaresymmetricalby packinginmoreinformation.Onthelowerleftpanel,wehavethePearsoncorrelation coefficients(nevermindthesmallrangesbeneatheachcoefficientfornow).Insteadof repeatingthesecoefficientsfortheupperrightpanel,wecanshowasmallscatterplotthere instead. Wearen’tlimitedtoshowingthecoefficientsandscatterplotsinourcorrgram,though; therearemanyotheroptionsandconfigurationsavailable: >corrgram(iris,lower.panel=panel.pie,upper.panel=panel.pts, +diag.panel=panel.density, +main=paste0("corrgramofpetalandsepal", +"measurementsinirisdataset")) Figure3.14:Anothercorrgramoftheirisdataset’scontinuousvariables Noticethatthistime,wecanoverlayadensityplotwhereverthereisavariablename(on thediagonal)—justtogetasenseofthevariables’shapes.Moresaliently,insteadoftext coefficients,wehavepiechartsinthelower-leftpanel.Thesepiechartsaremeantto graphicallydepictthestrengthofthecorrelations. Ifthecolorofthepieisblue(oranyshadethereof),thecorrelationispositive;thebigger theshadedareaofthepie,thestrongerthemagnitudeofthecorrelation.If,however,the colorofthepieisredorashadeofred,thecorrelationisnegative,andtheamountof shadingonthepieisproportionaltothemagnitudeofthecorrelation. Totopitalloff,weaddedthemainparametertosetthetitleoftheplot.Notetheuseof paste0sothatIcouldsplitthetitleupintotwolinesofcode. Togetabettersenseofwhatcorrgramiscapableof,youcanviewalivedemonstrationof examplesifyouexecutethefollowingattheprompt: >example(corrgram) Exercises Tryoutthefollowingexercisestorevisetheconceptslearnedsofar: Lookatthedocumentationoncorwithhelp("cor").Youcansee,inadditionto "pearson"and"spearman",thereisanoptionfor"kendall".LearnaboutKendall’s tau.Why,andunderwhatconditions,isitconsideredbetterthanSpearman’srho? Foreachspeciesofiris,findthecorrelationcoefficientbetweenthesepallengthand width.Arethereanydifferences?Howdidwejustcombinetwodifferenttypesofthe broadcategoriesofbivariateanalysestoperformacomplexmultivariateanalysis? Downloadadatasetfromtheweb,orfindanotherbuilt-into-Rdatasetthatsuitsyour fancy(usinglibrary(help="datasets")).Explorerelationshipsbetweenthe variablesthatyouthinkmighthavesomeconnection. GustaveFlaubertiswellunderstoodtobeaclassistmisogynistandthis,ofcourse, influencedhowhedevelopedthecharacterofEmmaBovary.However,itisnot uncommonforthereaderstoidentifyandempathizewithher,andtheyareoften devastatedbythebook’sconclusion.Infact,translatorGeoffreyWallassertsthat Emmadiesinapainthatisexactlyadjustedtotheintensityofourpreceding identification. HowcanthefactthatsomesympathizewithEmmabereconciledwithFlaubert’s apparentintention?Inyourresponse,assumeapost-structuralistapproachto authorialintent. Summary Thereweremanynewideasintroducedinthischapter,sokudostoyouformakingit through!You’rewellonthewaytobeingabletotacklesomeextraordinarilyinteresting problemsonyourown! Tosummarize,inthischapter,welearnedthattherelationshipsbetweentwovariablescan bebrokendownintothreebroadcategories. Forcategorical/continuousvariables,welearnedhowtousethebyfunctiontoretrievethe statisticsonthecontinuousvariableforeachcategory.Wealsosawhowwecanuseboxand-whiskerplotstovisuallyinspectthedistributionsofthecontinuousvariableacross categories. Forcategorical/categoricalconfigurations,weusedcontingencyandproportionstablesto comparefrequencies.Wealsosawhowmosaicplotscanhelpspotinterestingaspectsof thedatathatmightbedifficulttodetectwhenjustlookingattherawnumbers. Forcontinuous/continuousdatawediscoveredtheconceptsofcovarianceandcorrelations andexploreddifferentcorrelationcoefficientswithdifferentassumptionsaboutthenature ofthebivariaterelationship.Wealsolearnedhowtheseconceptscouldbeexpandedto describetherelationshipbetweenmorethantwocontinuousvariables.Finally,welearned howtousescatterplotsandcorrgramstovisuallydepicttheserelationships. Withthischapter,we’veconcludedtheunitonexploratorydataanalysis,andwe’llbe movingontoconfirmatorydataanalysisandinferentialstatistics. Chapter4.Probability It’stimeforustoputdescriptivestatisticsdownforthetimebeing.Itwasfunforawhile, butwe’renolongercontentjustdeterminingthepropertiesofobserveddata;nowwewant tostartmakingdeductionsaboutdatawehaven’tobserved.Thisleadsustotherealmof inferentialstatistics. Indataanalysis,probabilityisusedtoquantifyuncertaintyofourdeductionsabout unobserveddata.Inthelandofinferentialstatistics,probabilityreignsqueen.Manyregard herasaharshmistress,butthat’sjustarumor. Basicprobability Probabilitymeasuresthelikelinessthataparticulareventwilloccur.When mathematicians(us,fornow!)speakofanevent,wearereferringtoasetofpotential outcomesofanexperiment,ortrial,towhichwecanassignaprobabilityofoccurrence. Probabilitiesareexpressedasanumberbetween0and1(orasapercentageoutof100). Aneventwithaprobabilityof0denotesanimpossibleoutcome,andaprobabilityof1 describesaneventthatiscertaintooccur. Thecanonicalexampleofprobabilityatworkisacoinflip.Inthecoinflipevent,thereare twooutcomes:thecoinlandsonheads,orthecoinlandsontails.Pretendingthatcoins neverlandontheiredge(theyalmostneverdo),thosetwooutcomesaretheonlyones possible.Thesamplespace(thesetofallpossibleoutcomes),therefore,is{heads,tails}. Sincetheentiresamplespaceiscoveredbythesetwooutcomes,theyaresaidtobe collectivelyexhaustive. Thesumoftheprobabilitiesofcollectivelyexhaustiveeventsisalways1.Inthisexample, theprobabilitythatthecoinflipwillyieldheadsoryieldtailsis1;itiscertainthatthecoin willlandononeofthose.Inafairandcorrectlybalancedcoin,eachofthosetwo outcomesisequallylikely.Therefore,wesplittheprobabilityequallyamongthe outcomes:intheeventofacoinflip,theprobabilityofobtainingheadsis0.5,andthe probabilityoftailsis0.5aswell.Thisisusuallydenotedasfollows: Theprobabilityofacoinflipyieldingeitherheadsortailslookslikethis: Andtheprobabilityofacoinflipyieldingbothheadsandtailsisdenotedasfollows: Thetwooutcomes,inadditiontobeingcollectivelyexhaustive,arealsomutually exclusive.Thismeansthattheycanneverco-occur.Thisiswhytheprobabilityofheads andtailsis0;itjustcan’thappen. Thenextobligatoryapplicationofbeginnerprobabilitytheoryisinthecaseofrollinga standardsix-sideddie.Intheeventofadieroll,thesamplespaceis{1,2,3,4,5,6}. Witheveryrollofthedie,wearesamplingfromthisspace.Inthisevent,too,each outcomeisequallylikely,exceptnowwehavetodividetheprobabilityacrosssix outcomes.Inthefollowingequation,wedenotetheprobabilityofrollinga1asP(1): Rollinga1orrollinga2isnotcollectivelyexhaustive(wecanstillrolla3,4,5,or6),but theyaremutuallyexclusive;wecan’trolla1and2.Ifwewanttocalculatetheprobability ofeitheroneoftwomutuallyexclusiveeventsoccurring,weaddtheprobabilities: Whilerollinga1orrollinga2aren’tmutuallyexhaustive,rolling1andnotrollinga1are. Thisisusuallydenotedinthismanner: Thesetwoevents—andalleventsthatarebothcollectivelyexhaustiveandmutually exclusive—arecalledcomplementaryevents. Ourlastpedagogicalexampleinthebasicprobabilitytheoryisusingadeckofcards.Our deckhas52cards—4foreachnumberfrom2to10and4eachofJack,Queen,King,and Ace(noJokers!).Eachofthese4cardsbelongtoonesuit,eitheraHeart,Club,Spadeor Diamond.Thereare,therefore,13cardsineachsuit.Further,everyHeartandDiamond cardiscoloredred,andeverySpadeandClubareblack.Fromthis,wecandeducethe followingprobabilitiesfortheoutcomeofrandomlychoosingacard: What,then,istheprobabilityofgettingablackcardandanAce?Well,theseeventsare conditionallyindependent,meaningthattheprobabilityofeitheroutcomedoesnotaffect theprobabilityoftheother.Incaseslikethese,theprobabilityofeventAandeventBis theproductoftheprobabilityofAandtheprobabilityofB.Therefore: Intuitively,thismakessense,becausetherearetwoblackAcesoutofapossible52. WhatabouttheprobabilitythatwechoosearedcardandaHeart?Thesetwooutcomes arenotconditionallyindependent,becauseknowingthatthecardisredhasabearingon thelikelihoodthatthecardisalsoaHeart.Incaseslikethese,theprobabilityofeventA andBisdenotedasfollows: WhereP(A|B)meanstheprobabilityofAgivenB.Forexample,ifwerepresentAas drawingaHeartandBasdrawingaredcard,P(A|B)meanswhat’stheprobabilityof drawingaheartifweknowthatthecardwedrewwasred?.Sincearedcardisequally likelytobeaHeartoraDiamond,P(A|B)is0.5.Therefore: Intheprecedingequation,weusedtheformP(B)P(A|B).HadweusedtheformP(A) P(B|A),wewouldhavegotthesameanswer: So,thesetwoformsareequivalent: Forkicks,let’sdividebothsidesoftheequationbyP(B).Thatyieldsthefollowing equivalence: ThisequationisknownasBayes’Theorem.Thisequationisveryeasytoderive,butits meaningandinfluenceisprofound.Infact,itisoneofthemostfamousequationsinallof mathematics. Bayes’Theoremhasbeenappliedtoandprovenusefulinanenormousamountofdifferent disciplinesandcontexts.ItwasusedtohelpcracktheGermanEnigmacodeduringWorld WarII,savingthelivesofmillions.Itwasalsousedrecently,andfamously,byNateSilver tohelpcorrectlypredictthevotingpatternsof49statesinthe2008USpresidential election. Atitscore,Bayes’Theoremtellsushowtoupdatetheprobabilityofahypothesisinlight ofnewevidence.Duetothis,thefollowingformulationofBayes’Theoremisoftenmore intuitive: whereHisthehypothesisandEistheevidence. Let’sseeanexampleofBayes’Theoreminaction! There’sahotnewrecreationaldrugonthescenecalledAllighate(orAllyforshort).It’s namedassuchbecauseitmakesitsusersgowildandactlikeanalligator.Sincetheeffect ofthedrugissodeleterious,veryfewpeopleactuallytakethedrug.Infact,onlyabout1 ineverythousandpeople(0.1%)takeit. Frightenedbyfear-mongeringlate-nightnews,DaisyGirl,Inc.,atechnologyconsulting firm,orderedanAllighatetestingkitforallofits200employeessothatitcouldoffer treatmenttoanyemployeewhohasbeenusingit.Notsparinganyexpense,theybought thebestkitonthemarket;ithad99%sensitivityand99%specificity.Thismeansthatit correctlyidentifieddrugusers99outof100times,andonlyfalselyidentifiedanon-user asauseronceinevery100times. Whentheresultsfinallycameback,twoemployeestestedpositive.Thoughthetwodenied usingthedrug,theirsupervisor,Ronald,wasreadytosendthemofftogethelp.Justas Ronaldwasabouttosendthemoff,Shanice,acleveremployeefromthestatistics department,cametotheirdefense. Ronaldincorrectlyassumedthateachoftheemployeeswhotestedpositivewereusingthe drugwith99%certaintyand,therefore,thechancesthatbothwereusingitwas98%. Shaniceexplainedthatitwasactuallyfarmorelikelythatneitheremployeewasusing Allighate. Howso?Let’sfindoutbyapplyingBayes’theorem! Let’sfocusonjustoneemployeerightnow;letHbethehypothesisthatoneofthe employeesisusingAlly,andErepresenttheevidencethattheemployeetestedpositive. Wewanttosolvetheleftsideoftheequation,solet’spluginvalues.Thefirstpartofthe rightsideoftheequation,P(PositiveTest|AllyUser),iscalledthelikelihood.The probabilityoftestingpositiveifyouusethedrugis99%;thisiswhattrippedupRonald— andmostotherpeoplewhentheyfirstheardoftheproblem.Thesecondpart,P(Ally User),iscalledtheprior.Thisisourbeliefthatanyonepersonhasusedthedrugbefore wereceiveanyevidence.Sinceweknowthatonly.1%ofpeopleuseAlly,thiswouldbea reasonablechoiceforaprior.Finally,thedenominatoroftheequationisanormalizing constant,whichensuresthatthefinalprobabilityintheequationwilladduptooneofall possiblehypotheses.Finally,thevaluewearetryingtosolve,P(Allyuser|Positive Test),istheposterior.Itistheprobabilityofourhypothesisupdatedtoreflectnew evidence. Inmanypracticalsettings,computingthenormalizingfactorisverydifficult.Inthiscase, becausethereareonlytwopossiblehypotheses,beingauserornot,theprobabilityof findingtheevidenceofapositivetestisgivenasfollows: Whichis:(.99*.001)+(.01*.999)=0.01098 Pluggingthatintothedenominator,ourfinalansweriscalculatedasfollows: Notethatthenewevidence,whichfavoredthehypothesisthattheemployeewasusing Ally,shiftedourpriorbelieffrom.001to.09.Evenso,ourpriorbeliefaboutwhetheran employeewasusingAllywassoextraordinarilylow,itwouldtakesomeveryverystrong evidenceindeedtoconvinceusthatanemployeewasanAllyuser. Ignoringthepriorprobabilityincasesliketheseisknownasbase-ratefallacy.Shanice assuagedRonald’sembarrassmentbyassuringhimthatitwasaverycommonmistake. Nowtoextendthistotwoemployees:theprobabilityofanytwoemployeesbothusingthe drugis,aswenowknow,.01squared,or1milliontoone.Squaringournewposterior yields,weget.0081.TheprobabilitythatbothemployeesuseAlly,evengiventheir positiveresults,islessthan1%.So,theyareexonerated. Sallyisadifferentstory,though.Herfriendsnoticedherbehaviorhaddramatically changedasoflate—shesnapsatco-workersandhastakentoeatingpencils.Her concernedcubicle-mateevenfollowedherafterworkandsawhercrawlintoasewer,not toemergeuntilthenextdaytogobacktowork. EventhoughSallypassedthedrugtest,weknowthatit’slikely(almostcertain)thatshe usesAlly.Bayes’theoremgivesusawaytoquantifythatprobability! Ourprioristhesame,butnowourlikelihoodisprettymuchascloseto1asyoucangetafterall,howmanynon-Allyusersdoyouthinkeatpencilsandliveinsewers? Ataleoftwointerpretations Thoughitmayseemstrangetohear,thereisactuallyahotphilosophicaldebateabout whatprobabilityreallyis.Thoughthereareothers,thetwoprimarycampsintowhich virtuallyallmathematiciansfallarethefrequentistcampandtheBayesiancamp. Thefrequentistinterpretationdescribesprobabilityastherelativelikelihoodofobserving anoutcomeinanexperimentwhenyourepeattheexperimentmultipletimes.Flippinga coinisaperfectexample;theprobabilityofheadsconvergesto50%asthenumberof timesitisflippedgoestoinfinity. Thefrequentistinterpretationofprobabilityisinherentlyobjective;thereisatrue probabilityoutthereintheworld,whichwearetryingtoestimate. TheBayesianinterpretation,however,viewsprobabilityasourdegreeofbeliefabout something.Becauseofthis,theBayesianinterpretationissubjective;whenevidenceis scarce,therearesometimeswildlydifferentdegreesofbeliefamongdifferentpeople. Describedinthismanner,Bayesianismmayscaremanypeopleoff,butitisactuallyquite intuitive.Forexample,whenameteorologistdescribestheprobabilityofrainas70%, peoplerarelybataneyelash.ButthisnumberonlyreallymakessensewithinaBayesian frameworkbecauseexactmeteorologicalconditionsarenotrepeatable,asisrequiredby frequentistprobability. Notsimplyaheadyacademicexercise,thesetwointerpretationsleadtodifferent methodologiesinsolvingproblemsindataanalysis.Manytimes,bothapproachesleadto similarresults.Wewillseeexamplesofusingbothapproachestosolveaproblemlaterin thisbook. Thoughpractitionersmaystronglyalignthemselveswithonesideoveranother,good statisticiansknowthatthere’satimeandaplaceforbothapproaches. Note ThoughBayesianismasavalidwayoflookingatprobabilityisdebated,Bayestheoremis afactaboutprobabilityandisundisputedandnon-controversial. Samplingfromdistributions Observingtheoutcomeoftrialsthatinvolvearandomvariable,avariablewhosevalue changesduetochance,canbethoughtofassamplingfromaprobabilitydistribution—one thatdescribesthelikelihoodofeachmemberofthesamplespaceoccurring. Thatsentenceprobablysoundsmuchscarierthanitneedstobe.Takeadierollfor example. Figure4.1:Probabilitydistributionofoutcomesofadieroll Eachrollofadieislikesamplingfromadiscreteprobabilitydistributionforwhicheach outcomeinthesamplespacehasaprobabilityof0.167or1/6.Thisisanexampleofa uniformdistribution,becausealltheoutcomesareuniformlyaslikelytooccur.Further, thereareafinitenumberofoutcomes,sothisisadiscreteuniformdistribution(therealso existcontinuousuniformdistributions). Flippingacoinislikesamplingfromauniformdistributionwithonlytwooutcomes. Morespecifically,theprobabilitydistributionthatdescribescoin-flipeventsiscalleda Bernoullidistribution—it’sadistributiondescribingonlytwoevents. Parameters Weuseprobabilitydistributionstodescribethebehaviorofrandomvariablesbecausethey makeiteasytocomputewithandgiveusalotofinformationabouthowavariable behaves.Butbeforeweperformcomputationswithprobabilitydistributions,wehaveto specifytheparametersofthosedistributions.Theseparameterswilldetermineexactly whatthedistributionlookslikeandhowitwillbehave. Forexample,thebehaviorofbotha6-sideddieanda12-sideddieismodeledwitha uniformdistribution.Eventhoughthebehaviorofboththediceismodeledasuniform distributions,thebehaviorofeachisalittledifferent.Tofurtherspecifythebehaviorof eachdistribution,wedetailitsparameter;inthecaseofthe(discrete)uniformdistribution, theparameteriscalledn.Auniformdistributionwithparameternhasnequallylikely outcomesofprobability1/n.Thenfora6-sideddieanda12-sideddieis6and12 respectively. ForaBernoullidistribution,whichdescribestheprobabilitydistributionofaneventwith onlytwooutcomes,theparameterisp.Outcome1occurswithprobabilityp,andtheother outcomeoccurswithprobability1-p,becausetheyarecollectivelyexhaustive.Theflip ofafaircoinismodeledasaBernoullidistributionwithp=0.5. Imagineasix-sideddiewithonesidelabeled1andtheotherfivesideslabeled2.The outcomeofthedierolltrialscanbedescribedwithaBernoullidistribution,too!Thistime, p=0.16(1/6).Therefore,theprobabilityofnotrollinga1is5/6. Thebinomialdistribution Thebinomialdistributionisafunone.Likeouruniformdistributiondescribedinthe previoussection,itisdiscrete. Whenaneventhastwopossibleoutcomes,successorfailure,thisdistributiondescribes thenumberofsuccessesinacertainnumberoftrials.Itsparametersaren,thenumberof trials,andp,theprobabilityofsuccess. Concretely,abinomialdistributionwithn=1andp=0.5describesthebehaviorofasingle coinflip—ifwechoosetoviewheadsassuccesses(wecouldalsochoosetoviewtailsas successes).Abinomialdistributionwithn=30andp=0.5describesthenumberofheads weshouldexpect. Figure4.2:Abinomialdistribution(n=30,p=0.5) Onaverage,ofcourse,wewouldexpecttohave15heads.However,randomnessisthe nameofthegame,andseeingmoreorfewerheadsistotallyexpected. Howcanweusethebinomialdistributioninpractice?,youask.Well,let’slookatan application. LarrytheUntrustworthyKnave—whocanonlybetrustedsomeofthetime—givesusa cointhatheallegesisfair.Weflipit30timesandobserve10heads. Itturnsoutthattheprobabilityofgettingexactly10headson30flipsisabout2.8%*.We canuseRtotellustheprobabilityofgetting10orfewerheadsusingthepbinomfunction: >pbinom(10,size=30,prob=.5) [1]0.04936857 Itappearsasiftheprobabilityofthisoccurring,inacorrectlybalancedcoin,isroughly 5%.DoyouthinkweshouldtakeLarryathisword? Note *Ifyou’reinterested Thewaywedeterminedtheprobabilityofgettingexactly10headsisbyusingthe probabilityformulaforBernoullitrials.Theprobabilityofgettingksuccessesinntrialsis equalto: wherepistheprobabilityofgettingonesuccessand: Ifyourpalmsaregettingsweaty,don’tworry.Youdon’thavetomemorizethisinorderto understandanylaterconceptsinthisbook. Thenormaldistribution DoyourememberinChapter2,TheShapeofDatawhenwedescribedthenormal distributionandhowubiquitousitis?Thebehaviorofmanyrandomvariablesinreallife isverywelldescribedbyanormaldistributionwithcertainparameters. Thetwoparametersthatuniquelyspecifyanormaldistributionareµ(mu)andσ(sigma). µ,themean,describeswherethedistribution’speakislocatedandσ,thestandard deviation,describeshowwideornarrowthedistributionis. Figure4.3:Normaldistributionswithdifferentparameters ThedistributionofheightsofAmericanfemalesisapproximatelynormallydistributed withparametersµ=65inchesandσ=3.5inches. Figure4.4:Normaldistributionswithdifferentparameters Withthisinformation,wecaneasilyanswerquestionsabouthowprobableitistochoose, atrandom,USwomenofcertainheights. AsmentionedearlierinChapter2,TheShapeofDatawecan’treallyanswerthequestion Whatistheprobabilitythatwechooseapersonwhoisexactly60inches?,because virtuallynooneisexactly60inches.Instead,weanswerquestionsabouthowprobableit isthatarandompersoniswithinacertainrangeofheights. Whatistheprobabilitythatarandomlychosenwomanis70inchesortaller?Ifyourecall, theprobabilityofaheightwithinarangeistheareaunderthecurve,ortheintegralover thatrange.Inthiscase,therangewewillintegratelookslikethis: Figure4.5:Areaunderthecurveoftheheightdistributionfrom70inchestopositive infinity >f<-function(x){dnorm(x,mean=65,sd=3.5)} >integrate(f,70,Inf) 0.07656373withabsoluteerror<2.2e-06 TheprecedingRcodeindicatesthatthereisa7.66%chanceofrandomlychoosinga womanwhois70inchesortaller. Luckilyforus,thenormaldistributionissopopularandwellstudied,thatthereisa functionbuiltintoR,sowedon’tneedtouseintegrationourselves. >pnorm(70,mean=65,sd=3.5) [1]0.9234363 Thepnormfunctiontellsustheprobabilityofchoosingawomanwhoisshorterthan70 inches.IfwewanttofindP(>70inches),wecaneithersubtractthisvalueby1(which givesusthecomplement)orusetheoptionalargumentlower.tail=FALSE.Ifyoudothis, you’llseethattheresultmatchesthe7.66%chancewearrivedatearlier. Thethree-sigmaruleandusingz-tables Whendealingwithanormaldistribution,weknowthatitismorelikelytoobservean outcomethatisclosetothemeanthanitistoobserveonethatisdistant—butjusthow muchmorelikely?Well,itturnsoutthatroughly68%ofallthevaluesdrawnfroma randomdistributionliewithin1standarddeviation,or1z-score,awayfromthemean. Expandingourboundaries,wefindthatroughly95%ofallvaluesarewithin2z-scores fromthemean.Finally,about99.7%ofnormaldeviatesarewithin3standarddeviations fromthemean.Thisiscalledthethree-sigmarule. Figure4.6:Thethree-sigmarule Beforecomputerscameonthescene,findingtheprobabilityofrangesassociatedwith randomdeviateswasalittlemorecomplicated.Tosavemathematiciansfromhavingto integratetheGaussian(normal)functionbyhand(eww!),theyusedaz-table,orstandard normaltable.Thoughusingthismethodtodayis,strictlyspeaking,unnecessary,anditisa littlemoreinvolved,understandinghowitworksisimportantataconceptuallevel.Notto mentionthatitgivesyoustreetcredasfarasstatisticiansareconcerned! Formally,thez-tabletellsusthevaluesofcumulativedistributionfunctionatdifferentzscoresofanormaldistribution.Lessabstractly,thez-tabletellsustheareaunderthecurve fromnegativeinfinitytocertainz-scores.Forexample,lookingup-1onaz-tablewilltell ustheareatotheleftof1standarddeviationbelowthemean(15.9%). Z-tablesonlydescribethecumulativedistributionfunction(areaunderthecurve)ofa standardnormaldistribution—onewithameanof0andastandarddeviationof1. However,wecanuseaz-tableonnormaldistributionswithanyparameters,µandσ.All youneedtodoisconvertavaluefromtheoriginaldistributionintoaz-score.Thisprocess iscalledstandardization. Touseaz-tabletofindtheprobabilityofchoosingaUSwomanatrandomwhoistaller than70inches,wefirsthavetoconvertthisvalueintoaz-score.Todothis,wesubtract themean(65inches)from70andthendividethatvaluebythestandarddeviation(3.5 inches). Then,wefind1.43onthez-table;onmostz-tablelayouts,thismeansfindingtherow labeled1.4(thez-scoreuptothetenthsplace)andthecolumn“.03”(thevalueinthe hundredthsplace).Thevalueatthisintersectionis.9236,whichmeansthatthe complement(someonetallerthan70inches)is1-.9236=0.0764.Thisisthesameanswer wegotwhenweusedintegrationandthepnormfunction. Exercises Practisethefollowingexercisestoreinforcetheconceptslearnedinthischapter: RecallthedrugtestingatDaisyGirl,Inc.earlierinthechapter.Weused.1%asour priorprobabilitythattheemployeewasusingthedrug.Whyshouldthispriorhave beenevenlower?UsingasubjectiveBayesianinterpretationofprobability,estimate whatthepriorshouldhavebeengiventhattheemployeewasabletoholddownajob andnoonesawher/himactlikeanalligator. HarkenbacktotheexampleofthecoinfromLarrytheUntrustworthyKnave.We wouldexpecttheproportionofheadsinafaircointhatisflippedmanytimestobe around50%.InLarry’scoin,theproportionwas2/3,whichisunlikelytooccur.The probabilityof20headsin30flipswas2.1%.However,findtheprobabilityofgetting 40headsin60flips.Eventhoughtheproportionsarethesame,whyistheprobability ofobserving40headsin60flipssosignificantlylessprobable?Understandingthe answertothisquestioniskeytounderstandingsamplingtheoryandinferentialdata analysis. Usethebinomialdistributionandpbinomtocalculatetheprobabilityofobserving10 orfewer“1”swhenrollingafair6-sideddie50times.Viewrollinga“1”asasuccess andnotrolling“1”asafailure.Whatisthevalueoftheparameter,p? Useaz-tabletofindtheprobabilityofchoosingaUSwomanatrandomwhois60 inchesorshorter.Whyisthisthesameprobabilityaschoosingonewhois70inches ortaller? Supposeatrolleyiscomingdownthetracks,anditsbrakesarenotworking.Itis poisedtorunoverfivepeoplewhoarehangingoutonthetracksaheadofit.Youare nexttoaleverthatcanchangethetracksthatthetrolleyisridingon.However,the secondsetoftrackshasonepersonhangingoutonit,too. Isitmorallywrongtonotpulltheleversothatonlyonepersonishurt,rather thanfive? Howwouldautilitarianrespond?Next,whatwouldThomasAquinassayabout this?BackupyourthesisbyappealingtotheDoctrineoftheDoubleEffectin SummaTheologica.Also,whatwouldKantsay?Backupyourresponseby appealingtothecategoricalimperativeintroducedintheFoundationofthe MetaphysicofMorals. Summary Inthischapter,wetookadetourthroughprobabilityland.Youlearnedsomebasiclawsof probability,aboutsamplespaces,andconditionalindependence.Youalsolearnedhowto deriveBayes’Theoremandlearnedthatitprovidestherecipeforupdatinghypothesesin thelightofnewevidence. Wealsotoucheduponthetwoprimaryinterpretationsofprobability.Infuturechapters, wewillbeemployingtechniquesfromboththoseapproaches. Weconcludedwithanintroductiontosamplingfromdistributionsandusedtwo—the binomialandthenormaldistributions—toanswerinterestingnon-trivialquestionsabout probability. Thischapterlaidtheimportantfoundationthatsupportsconfirmatorydataanalysis. Makingandcheckinginferencesbasedondataisallaboutprobabilityand,atthispoint, weknowenoughtomoveontohaveagreattimetestinghypotheseswithdata! Chapter5.UsingDatatoReasonAbout theWorld InChapter4,Probability,wementionedthatthemeanheightofUSfemalesis65inches. Nowpretendwedidn’tknowthisfact—howcouldwefindoutwhattheaverageheightis? WecanmeasureeveryUSfemale,butthat’suntenable;wewouldrunoutofmoney, resources,andtimebeforeweevenfinishedwithasmallcity! Inferentialstatisticsgivesusthepowertoanswerthisquestionusingaverysmallsample ofallUSwomen.Wecanusethesampletotellussomethingaboutthepopulationwe drewitfrom.Wecanuseobserveddatatomakeinferencesaboutunobserveddata.Bythe endofthischapter,youtoowillbeabletogooutandcollectasmallamountofdataand useittoreasonabouttheworld! Estimatingmeans Intheexamplethatisgoingtospanthisentirechapter,wearegoingtobeexamininghow wewouldestimatethemeanheightofallUSwomenusingonlysamples.Specifically,we willbeestimatingthepopulationparametersusingsamples’meansasanestimator. Iamgoingtousethevectorall.us.womentorepresentthepopulation.Forsimplicity’s sake,let’ssaythereareonly10,000USwomen. >#settingseedwillmakerandomnumbergenerationreproducible >set.seed(1) >all.us.women<-rnorm(10000,mean=65,sd=3.5) Wehavejustcreatedavectorof10,000normallydistributedrandomvariableswiththe sameparametersasourpopulationofinterestusingthernormfunction.Ofcourse,atthis point,wecanjustcallmeanonthisvectorandcallitaday—butthat’scheating!Weare goingtoseethatwecangetreallyreallyclosetothepopulationmeanwithoutactually usingtheentirepopulation. Now,let’stakearandomsampleoftenfromthispopulationusingthesamplefunctionand computethemean: >our.sample<-sample(all.us.women,10) >mean(our.sample) [1]64.51365 Hey,notabadstart! Oursamplewill,inalllikelihood,containsomeshortpeople,somenormalpeople,and sometallpeople.There’sachancethatwhenwechooseasamplethatwechooseonethat containspredominatelyshortpeople,oradisproportionatenumberoftallpeople.Because ofthis,ourestimatewillnotbeexactlyaccurate.However,aswechoosemoreandmore peopletoincludeinoursample,thosechanceoccurrences—imbalancedproportionsofthe shortandtall—tendtobalanceeachotherout. Notethatasweincreaseoursamplesize,thesamplemeanisn’talwaysclosertothe populationmean,butitwillbecloseronaverage. Wecantestthatassertionourselves!Studythefollowingcodecarefullyandtryrunningit yourself. >population.mean<-mean(all.us.women) > >for(sample.sizeinseq(5,30,by=5)){ +#createemptyvectorwith1000elements +sample.means<-numeric(1000) +for(iin1:1000){ +sample.means[i]<-mean(sample(all.us.women,sample.size)) +} +distances.from.true.mean<-abs(sample.means-population.mean) +mean.distance.from.true.mean<-mean(distances.from.true.mean) +print(mean.distance.from.true.mean) +} [1]1.245492 [1]0.8653313 [1]0.7386099 [1]0.6355692 [1]0.5458136 [1]0.5090788 Foreachsamplesizefrom5to30(goingupby5),wewilltake1000differentsamples fromthepopulation,calculatetheirmean,takethedifferencesfromthepopulationmean, andaveragethem. Figure5.1:Accuracyofsamplemeansasafunctionofsamplesize Asyoucansee,increasingthesamplesizegetsusclosertothepopulationmean. Increasingthesamplesizealsoreducesthestandarddeviationbetweenthemeansofthe samples. Figure5.2:Thevariabilityofsamplemeansasafunctionofsamplesize Knowingthat,withallotherthingsbeingequal,largersamplesarepreferabletosmaller ones,let’sworkwithasamplesizeof40forrightnow.We’lltakeoursampleandestimate ourpopulationmeanasfollows: >mean(our.new.sample) [1]65.19704 Thesamplingdistribution So,wehaveestimatedthatthetruepopulationmeanisabout65.2;weknowthe populationmeanisn’texactly65.19704—butbyjusthowmuchmightourestimatebe off? Toanswerthisquestion,let’stakerepeatedsamplesfromthepopulationagain.Thistime, we’regoingtotakesamplesofsize40fromthepopulation10,000timesandplota frequencydistributionofthemeans. >means.of.our.samples<-numeric(10000) >for(iin1:10000){ +a.sample<-sample(all.us.women,40) +means.of.our.samples[i]<-mean(a.sample) +} Figure5.3:Thesamplingdistributionofsamplemeans Thisfrequencydistributioniscalledasamplingdistribution.Inparticular,sinceweused samplemeansasthevalueofinterest,thisiscalledthesamplingdistributionofthesample means(whew!!).Youcancreateasamplingdistributionofanystatistic(median,variance, andsoon),butwhenwerefertosamplingdistributionsthroughoutthischapter,wewillbe specificallyreferringtothesamplingdistributionofsamplemeans. Checkitout:thesamplingdistributionlookslikeanormaldistribution—andthat’s becauseitisanormaldistribution. Foralargeenoughsamplesize,thesamplingdistributionofanypopulationwillbe approximatelynormalwithameanequaltothepopulationmean,µ,andastandard deviationof: whereNisthesamplesizeandσisthepopulationstandarddeviation.Thisiscalledthe centrallimittheorem,anditisamongthemostimportanttheoremsinallofstatistics. Lookbackattheequation.Convinceyourselfthatsamplesizeisproportionaltothe narrownessofthesamplingdistributionbynotingthatthesamplesizeisinthe denominator. Thestandarddeviationofthesamplingdistributiontellsushowvariableasampleofa certainsize’smeancanbefromsampletosample.Italsotellsushowmuchweexpect certainsamples’meanstovaryfromthetruepopulationmean.Thestandarddeviationof thesamplingdistributioniscalledthestandarderror,andwecanuseittoquantifyour uncertaintyaboutourestimateofthepopulationmean. Ifthestandarderrorissmall,anestimatefromonesampleislikelytobeclosertothetrue mean(becausethesamplingdistributionisnarrow).Ifourstandarderrorisbig,themean ofanyoneparticularsampleislikelytobefartherawayfromthetruemean,onaverage. Okay,soI’veconvincedyouthatthestandarderrorisagreatstatistictouse—buthowdo wegetit?Upuntilnow,I’vesaidthatyoucancalculateitbyeither: Takingmanymanysamplesfromthepopulationandtakingthestandarddeviationof thesamplemeans Dividingthestandarddeviationofthepopulationbythesquarerootofthesample size However,inpractice,thisisn’tgoodenough:wedon’twanttotakerepeatedsamplesfrom thepopulationforthesamereasonthatwecan’tmeasuretheheightsofallUSwomen (becauseitwouldtaketoolongandcosttoomuch).And,inthecaseofusingthe populationstandarddeviationtogetthestandarderror—well,wedon’tknowthe populationstandarddeviation—ifwedid,wewouldhavealreadyhadtocalculatethe populationmean,andwewouldn’thavetobeestimatingitwithsampling! Ideally,wewanttofindthestandarderrorusingonlyonesample.Well,itturnsoutthatfor sufficientlylargesamples,usingthesamplestandarddeviation,s,inthestandarderror formula(insteadofthepopulationstandarddeviation,σ)isagoodenoughapproximation. Similarly,themeanofthesamplingdistributionisequaltothepopulationmean,butwe canuseoursample’smeanasanestimateofthat. Note Toreiterate,forasampleofsufficientsize,wecanpretendthatthesamplingdistribution ofthesamplemeanshasameanequaltothesample’smeanandastandarddeviationof thesample’sstandarddeviationdividedbythesquarerootofthesamplesize.This standarddeviationofthesamplingdistributioniscalledthestandarderror,anditisavery importantnumberforquantifyingtheuncertaintyofourestimationofthepopulationmean fromthesamplemean. Foraconcreteexample,let’suseoursampleof40,our.new.sample: >mean(our.new.sample) [1]65.19704 >sd(our.new.sample) [1]3.588447 >sd(our.new.sample)/sqrt(length(our.new.sample)) [1]0.5673833 Oursample’smeanandstandarddeviationis65.2and3.59respectively.Thestandard errorofthemeanis0.567. Thismeansthatthesamplingdistributionofthesamplemeanswouldlooksomethinglike this: Figure5.4:Estimatedsamplingdistributionofsamplemeansbasedononesample Intervalestimation Again,wecareaboutthestandarderror(thestandarddeviationofthesampling distributionofsamplemeans)becauseitexpressesthedegreeofuncertaintywehavein ourestimation.Becauseofthis,it’snotuncommonforstatisticianstoreportthestandard erroralongwiththeirestimate. What’smorecommon,though,isforstatisticianstoreportarangeofnumberstodescribe theirestimates;thisiscalledintervalestimation.Incontrast,whenwewerejustproviding thesamplemeanasourestimateofthepopulationmean,wewereengaginginpoint estimation. Onecommonapproachtointervalestimationistouseconfidenceintervals.Aconfidence intervalgivesusarangeoverwhichasignificantproportionofthesamplemeanswould fallwhensamplesarerepeatedlydrawnfromapopulationandtheirmeansarecalculated. Concretely,a95%confidenceintervalistherangethatwouldcontain95%ofthesample meansifmultiplesamplesweretakenfromthesamepopulation.95%confidenceintervals areverycommon,but90%and99%confidenceintervalsaren’trare. Thinkaboutthisforasecond:ifa95%confidenceintervalcontains95%ofthesample means,thatmeansthatthe95%confidenceintervalcovers95%oftheareaofthe samplingdistribution. Figure5.5:The95%confidenceintervalofourestimateofthesamplemean(64.085to 66.31)covers95%oftheareaintheourestimatedsamplingdistribution Okay,sohowdowefindtheboundsoftheconfidenceinterval?Thinkbacktothethree-zs rulefromthepreviouschapteronprobability.Recallthatabout95%ofanormal distribution’sareaiswithintwostandarddeviationsofthemean.Well,iftheboundsofa confidenceintervalcover95%ofthesamplingdistribution,thentheboundsmustbetwo standarddeviationsawayfromthemeanonbothsides!Sincethestandarddeviationofthe distributionofinterest(thesamplingdistributionofsamplemeans)isthestandarderror, theboundsoftheconfidenceintervalarethemeanminus2timesthestandarderrorand themeanplus2timesthestandarderror. Inreality,twostandarddeviations(ortwoz-scores)awayfromthemeancontainalittlebit morethan95%oftheareaofthedistribution.Tobemoreprecise,therangebetween-1.96 z-scoresand1.96z-scorescontains95%ofthearea.Therefore,theboundsofa95% confidenceintervalare: where isthesamplemeanandsisthesamplestandarddeviation. Inourexample,ourboundsare: >err<-sd(our.new.sample)/sqrt(length(our.new.sample)) >mean(our.new.sample)-(1.96*err) [1]64.08497 >mean(our.new.sample)+(1.96*err) [1]66.30912 Howdidweget1.96? Youcangetthisnumberyourselfbyusingtheqnormfunction. Theqnormfunctionisalittleliketheoppositeofthepnormfunctionthatwesawinthe previouschapter.Thatfunctionstartedwithapbecauseitgaveusaprobability—the probabilitythatwewouldseeavalueequaltoorbelowitinanormaldistribution.Theq inqnormstandsforquantile.Aquantile,foragivenprobability,isthevalueatwhichthe probabilitywillbeequaltoorbelowthatprobability. Iknowthatwasconfusing!Stateddifferently,butequivalently,aquantileforagiven probabilityisthevaluesuchthatifweputitinthepnormfunction,wegetbackthatsame probability. >qnorm(.025) [1]-1.959964 >pnorm(-1.959964) [1]0.025 Weshowedearlierthat95%oftheareaunderacurveofaprobabilitydistributionis within1.9599z-scoresawayfromthemean.Weput.025intheqnormfunction,becauseif themeanisrightsmackinthemiddleofthe95%confidenceinterval,thenthereis2.5% oftheareatotheleftoftheboundand2.5%oftheareatotherightofthebound.Together, thislower2.5%andupper2.5%makeupthemissing5%ofthearea. Don’tfeellimitedtothe95%confidenceinterval,though.Youcanfigureoutthebounds ofa90%confidenceintervalusingjustthesameprocedure.Inanintervalthatcontains 90%oftheareaofacurve,theboundsarethevaluesforwhich5%oftheareaistotheleft and5%oftheareaistotherightof(because5%and5%makeupthemissing10%)the curve. >qnorm(.05) [1]-1.644854 >qnorm(.95) [1]1.644854 >#noticethesymmetry? Thatmeansthatforthisexample,the90%confidenceintervalis65.2and66.13or65.197 +-0.933. Note Awarningaboutconfidenceintervals Therearemanymisconceptionsaboutconfidenceintervalsfloatingabout.Themost pervasiveisthemisconceptionthat95%confidenceintervalsrepresenttheintervalsuch thatthereisa95%chancethatthepopulationmeanisintheinterval.Thisisfalse.Once theboundsarecreated,itisnolongeraquestionofprobability;thepopulationmeanis eitherinthereorit’snot. Toconvinceyourselfofthis,taketwosamplesfromthesamedistributionandcreate95% confidenceintervalsforbothofthem.Theyaredifferent,right?Createafewmore.How coulditbethecasethatalloftheseintervalshavethesameprobabilityofincludingthe populationmean? UsingaBayesianinterpretationofprobability,itispossibletosaythatthereexists intervalsforwhichweare95%certainthatitencompassesthepopulationmean,since Bayesianprobabilityisameasureofourcertainty,ordegreeofbelief,insomething.This Bayesianresponsetoconfidenceintervalsiscalledcredibleintervals,andwewilllearn abouttheminChapter7,BayesianMethods.Theprocedurefortheirconstructionisvery differenttothatoftheconfidenceinterval. Smallersamples RememberwhenIsaidthatthesamplingdistributionofsamplemeansisapproximately normalforalargeenoughsamplesize?Thiscaveatmeansthatforsmallersamplesizes (usuallyconsideredtobebelow30),thesamplingdistributionofthesamplemeansisnot wellapproximatedbyanormaldistribution.Itis,however,wellapproximatedbyanother distribution:thet-distribution. Note Abitofhistory… Thet-distributionisalsoknownastheStudent’st-distribution.Itgetsitsnamefromthe 1908paperthatintroducesit,byWilliamSealyGossetwritingunderthepenname Student.GossetworkedasastatisticianattheGuinnessBreweryandusedthetdistributionandtherelatedt-testtostudysmallsamplesofthequalityofthebeer’sraw constituents.HeisthoughttohaveusedapennameattherequestofGuinnesssothat competitorswouldn’tknowthattheywereusingthetstatistictotheiradvantage. Thet-distributionhastwoparameters,themeanandthedegreesoffreedom(ordf).For ourpurposeshere,the‘degreesoffreedom’isequaltooursamplesize,-1.Forexample, ifwehaveasampleof10fromsomepopulationandthemeanis5,thenat-distribution withparametersmean=5anddf=9describesthesamplingdistributionofsamplemeans withthatsamplesize. Thet-distributionlooksalotlikethenormaldistributionatfirstglance.However,further examinationwillrevealthatthecurveismoreflatandwide.Thiswidenessaccountsfor thehigherlevelofuncertaintywehaveinregardtoasmallersample. Figure5.6:Thenormaldistribution,andtwot-distributionswithdifferentdegreesof freedom Noticethatasthesamplesize(degreesoffreedom)increases,thedistributiongets narrower.Asthesamplesizegetshigherandhigher,itgetscloserandclosertoanormal distribution.By29degreesoffreedom,itisveryclosetoanormaldistributionindeed. Thisiswhy30isconsideredagoodruleofthumbforwhatconstitutesagoodcut-off betweenlargesamplesizesandsmallsamplesizesand,thus,whendecidingwhetherto useanormaldistributionorat-distributionasamodelforthesamplingdistribution. Let’ssaythatwecouldonlyaffordtakingtheheightsof15USwomen.What,then,isour 95%intervalestimation? >small.sample<-sample(all.us.women,15) >mean(small.sample) [1]65.51277 >qt(.025,df=14) [1]-2.144787 >#noticethedifference >qnorm(.025) [1]-1.959964 Insteadofusingtheqnormfunctiontogetthecorrectmultipliertothestandarderror,we wanttofindthequantileofthet-distributionat.025(and.975).Forthis,weusetheqt function,whichtakesaprobabilityandnumberofdegreesoffreedom.Notethatthe quantileofthet-distributionislargerthanthequantileofthenormaldistribution,which willtranslatetolargerconfidenceintervalbounds;again,thisreflectstheadditional uncertaintywehaveinourestimateduetoasmallersamplesize. >err<-sd(small.sample)/sqrt(length(small.sample)) >mean(small.sample)-(2.145*err) [1]64.09551 >mean(small.sample)+(2.145*err) [1]66.93003 Inthiscase,theboundsofour95%confidenceintervalare64.1and66.9. Exercises Practisethefollowingexercisestorevisetheconceptslearnedinthischapter: Writeafunctionthattakesavectorandreturnsthe95%confidenceintervalforthat vector.Youcanreturntheintervalasavectoroflengthtwo:thelowerboundandthe upperbound.Then,parameterizetheconfidencecoefficientbylettingtheuserof yourfunctionchoosetheirownconfidencelevel,butkeep95%asthedefault.Hint: thefirstlinewilllooklikethis: conf.int<-function(data.vector,conf.coeff=.95){ Backwhenweintroducedthecentrallimittheorem,Isaidthatthesampling distributionfromanydistributionwouldbeapproximatelynormal.Don’ttakemy wordforit!Createapopulationthatisuniformlydistributedusingtheruniffunction andplotahistogramofthesamplingdistributionusingthecodefromthischapterand thehistogram-plottingcodefromChapter2,TheShapeofData.Repeattheprocess usingthebetadistributionwithparameters(a=0.5,b=0.5).Whatdoestheunderlying distributionlooklike?Whatdoesthesamplingdistributionlooklike? Aformalandrigorousdefinitionofknowledgeandwhatconstitutesknowledgeis stillanopenprobleminepistemology.SincePlatoandhisdialogues,apopular definitionofknowledgeistheJustifiedTrueBelief(JTB)account.Inthisaccount, anagentcanbesaidtoknowsomething,p,if(a)pistrue,(b)theagentbelievesthat pistrue,and(c)theagentisjustifiedinbelievingthatpistrue.Ina1963paper, EdmundGettierintroducedexamplesthatseemtosatisfytheseconditions,butappear nottobetruecasesofknowledge.ReadGettier’spaper.CantheJTBaccountof knowledgebemodifiedtoaccountforGettierproblems?OrshouldwerejecttheJTB accountofknowledgeandstartfromscratch? Summary Thecentralideaofthischapteristhatmakingtheleapfromsampletopopulationcarriesa certainamountofuncertaintywithit.Inordertobegood,honestanalysts,weneedtobe abletoexpressandquantifythisuncertainty. Theexamplewechosetoillustratethisprinciplewasestimatingpopulationmeanfroma sample’smean.Youlearnedthattheuncertaintyassociatedwithinferringthepopulation meanfromsamplemeansismodeledbythesamplingdistributionofthesamplemeans. Thecentrallimittheoremtellsustheparameterswecanexpectofthissampling distribution.Youlearnedthatwecouldusetheseparametersontheirown,orinthe constructionofconfidenceintervals,toexpressourlevelofuncertaintyaboutour estimate. Iwanttocongratulateyouforgettingthisfar.Thetopicsintroducedinthischapterare veryoftenconsideredthemostdifficulttograspinallofintroductorydataanalysis. Yourtenacitywillbegreatlyrewarded,though;wehavelaidenoughofafoundationtobe abletogetintosomereal,practicaltopics.Ipromisethenextchapterisalotoffun,andit isfilledwithinterestingexamplesthatyoucanstartapplyingtoreal-lifeproblemsright away! Chapter6.TestingHypotheses Thesalt-and-pepperofinferentialstatisticsisestimationandtestinghypotheses.Inthelast chapter,wetalkedaboutestimationandmakingcertaininferencesabouttheworld.Inthis chapter,wewillbetalkingabouthowtotestthehypothesesonhowtheworldworksand evaluatethehypothesesusingonlysampledata. Inthelastchapter,Ipromisedthatthiswouldbeaverypracticalchapter,andI’mamanof myword;thischaptergoesoverabroadrangeofthemostpopularmethodsinmodern dataanalysisatarelativelyhighlevel.Evenso,thischaptermighthavealittlemoredetail thanthelazyandimpatientwouldwant.Atthesametime,itwillhavewaytoolittledetail thanwhattheextremelycuriousandmathematicallyinclinedwant.Infact,some statisticianswouldhaveaheartattackatthedegreetowhichIskipoverthemathinvolved withthesesubjects—butIwon’ttellifyoudon’t! Nevertheless,certaincomplicatedconceptsandmatharebeyondthescopeofthisbook. Thegoodnewsisthatonceyou,dearreader,havethegeneralconceptsdown,itiseasyto deepenyourknowledgeofthesetechniquesandtheirintricacies—andIadvocatethatyou dobeforemakinganymajordecisionsbasedonthetestsintroducedinthesechapters. NullHypothesisSignificanceTesting Forbetterorworse,NullHypothesisSignificanceTesting(NHST)isthemostpopular hypothesistestingframeworkinmodernuse.So,eventhoughtherearecompeting approachesthat—atleastinsomecases—arebetter,youneedtoknowthisstuffupand down! Okay—NullHypothesisSignificanceTesting—thoseareabunchofbigwords.Whatdo theymean? NHSTisalotlikebeingaprosecutorintheUnitedStates’orGreatBritain’sjustice system.Inthesetwocountries—andafewothers—thepersonbeingchargedispresumed innocent,andtheburdenofprovingthedefendant’sguiltisplacedontheprosecutor.The prosecutorthenhastoarguethattheevidenceisinconsistentwiththedefendantbeing innocent.Onlyafteritisshownthattheextantevidenceisunlikelyifthepersonis innocent,doesthecourtruleaguiltyverdict.Iftheextantevidenceisweak,orislikelyto beobservedevenifthedependentisinnocent,thenthecourtrulesnotguilty.Thatdoesn’t meanthedefendantisinnocent(thedefendantmayverywellbeguilty!)—itmeansthat eitherthedefendantwasguilty,ortherewasnotsufficientevidencetoproveguilt. WithsimpleNHST,wearetestingtwocompetinghypotheses:thenullandthealternative hypotheses.Thedefaulthypothesisiscalledthenullhypothesis—itisthehypothesisthat ourobservationoccurredfromchancealone.Inthejusticesystemanalogy,thisisthe hypothesisthatthedefendantisinnocent.Thealternativehypothesisistheopposite(or complementary)hypothesis;thiswouldbeliketheprosecutor’shypothesis. ThenullhypothesisterminologywasintroducedbyastatisticiannamedR.A.Fischerin regardtothecuriouscaseofMurielBristol:awomanwhoclaimedthatshecoulddiscern, justbytastingit,whethermilkwasaddedbeforeteainateacuporwhethertheteawas pouredbeforethemilk.Sheismorecommonlyknownastheladytastingtea. Herclaimwasputtothetest!Theladytastingteawasgiveneightcups;fourhadmilk addedfirst,andfourhadteaaddedfirst.Hertaskwastocorrectlyidentifythefourcups thathadteaaddedfirst.Thenullhypothesiswasthatshecouldn’ttellthedifferenceand wouldchoosearandomfourteacups.Thealternativehypothesisis,ofcourse,thatshehad theabilitytodiscernwithertheteaormilkwaspouredfirst. Itturnedoutthatshecorrectlyidentifiedthefourcups.Thechancesofrandomlychoosing thecorrectfourcupsis70to1,orabout1.4%.Inotherwords,thechancesofthat happeningunderthenullhypothesisis1.4%.Giventhatitissoveryunlikelytohave occurredunderthenullhypothesis,wemaychoosetorejectthenullhypothesis.Ifthenull andalternativehypothesesaremutuallyexclusiveandcollectivelyexhaustive,thena rejectionofthenullhypothesisistantamounttoanacceptanceofthealternative hypothesis. Wecan’tsayanythingforcertain,butwecanworkwithprobabilities.Inthisexample,we wantedtoproveordisprovetheladytastingtea’sclaims.Wedidnottrytoevaluatethe probabilitythattheladycouldtellthedifference;weassumedthatshecouldnotandtried toshowthatitwasunlikelythatshecouldn’t,givenherstellarperformanceonthe assessment. So,here’sthebasicideabehindNHSTasweknowitsofar: 1. Assumetheoppositeofwhatyouaretesting. 2. (Tryto)showthattheresultsyoureceiveareunlikelygiventhatassumption. 3. Rejecttheassumption. Wehaveheretoforebeenratherhand-wavyaboutwhatconstitutessufficientunlikelihood torejectthenullhypothesisandhowwedeterminetheprobabilityinthefirstplace.We’ll discussthisnow. Inordertoquantifyhowlikelyorunlikelytheresultswereceiveare,weneedtodefinea teststatistic—somemeasureofthesample.Thesamplingdistributionoftheteststatistic willtelluswhichteststatisticsaremostlikelytooccurbychance(underthenull hypothesis)withrepeatedtrialsoftheexperiment.Onceweknowwhatthesampling distributionoftheteststatisticlookslike,wecantellwhattheprobabilityofgettinga resultasextremeaswegotis.Thisiscalledap-value.Ifitisequaltoorbelowsomeprespecifiedboundary,calledanalphalevel(αlevel),wedecidethatthenullhypothesisisa badhypothesisandembracethealternativehypothesis.Largely,asamatteroftradition,an alphalevelof.05isusedmostoften,thoughotherlevelsareoccasionallyusedaswell.So, iftheobservedresultwouldonlyoccur5%orlessofthetime(p-value<.05),weconsider itasufficientlyunlikelyeventandrejectthenullhypothesis.Ifthe.05cut-offsounds ratherarbitrary,it’sbecauseitis. So,here’sourupdatedandexpandedbasicideabehindNHST: 1. Formulateasetoftwohypotheses:anullhypothesis(oftendenotedasH0)andan alternativehypothesis(oftendenotedH1) H0:thereisnoeffect H1:thereisaneffect 2. Computetheteststatistic. 3. Giventhesamplingdistributionoftheteststatisticunderthenullhypothesis,youcan calculatetheprobabilityofobtainingateststatisticequaltoormoreextremethanthe oneyoucalculated.Thisisthep-value.Findit. 4. Iftheprobabilityofobtainingateststatisticbeingequaltoormoreextremethanthe oneyoucalculatedissufficientlyunlikely(equaltoorlessthanyouralphalevel), thenyoumayrejectthenullhypothesis. 5. Ifthenullandalternativehypothesesarecollectivelyexhaustive,youmayembrace thealternativehypothesis. Theillustrativeexamplethat’sgoingtomakesenseoutofallofthisisnoneotherthanthe gambitofLarrytheUntrustworthyKnavethatwemetinChapter4,Probability.Ifyou recall,Larry,whocanonlybetrustedsomeofthetime,gaveusacointhatheallegesis fair.Weflipit30timesandobserve10heads.Let’shypothesizethatthecoinisunfair; let’sformalizeourhypotheses: H0(nullhypothesis):theprobabilityofobtainingheadsonthiscoinis.5 H1(alternativehypothesis):theprobabilityofobtainingheadsonthiscoinisnot.5 Let’sjustusethenumberofheadsinoursampleastheteststatistic.Whatisthesampling distributionofthisteststatistic?Inotherwords,ifthecoinwerefair,andyourepeatedthe flipping-30-timesexperimentmanytimes,whatistherelativefrequencyofobserving particularnumbersofheads?We’veseenitalready!It’sthebinomialdistribution.A binomialdistributionwithparametersn=30andp=0.5describesthenumberofheadswe shouldexpectin30flips. Figure6.1:Thesamplingdistributionofourcoin-flipteststatistic(thenumberofheads) Asyoucansee,theoutcomethatisthemostlikelyisgetting15heads(asyoumight imagine).Canyouseewhattheprobabilityofgetting10headsis?Fairlyunlikely,right? So,what’sthep-value,andisitlessthanourpre-specifiedalphalevel?Well,wehave alreadyworkedouttheprobabilityofobserving10orfewerheadsinChapter4, Probability,asfollows: >pbinom(10,size=30,prob=.5) [1]0.04936857 It’slessthan.05.Wecanconcludethecoinisunfair,right?Well,yesandno.Mostlyno. Allowmetoexplain. Oneandtwo-tailedtests Youmayrejectthenullhypothesisiftheteststatisticfallswithinaregionunderthecurve ofthesamplingdistributionthatcovers5%ofthearea(ifthealphalevelis.05).Thisis calledthecriticalregion.Doyouremember,inthelastchapter,weconstructed95% confidenceintervalsthatcovered95%percentofthesamplingdistribution?Well,the5% criticalregionisliketheoppositeofthis.Recallthat,inordertomakeasymmetric95%of theareaunderthecurve,wehadtostartatthe.025quantileandendatthe.975quantile, leaving2.5%percentonthelefttailand2.5%oftherighttailuncovered. Similarly,inorderforthecriticalregionofahypothesistesttocover5%ofthemost extremeareasunderthecurve,theareamustcovereverythingfromtheleftofthe.025 quantileandeverythingtotherightofthe.975quantile. So,inordertodeterminethatthe10headsoutof30flipsisstatisticallysignificant,the probabilitythatyouwouldobserve10orfewerheadshastobelessthan.025. There’safunctionbuiltrightintoR,calledbinom.test,whichwillperformthe calculationsthatwehave,untilnow,beendoingbyhand.Inthemostbasicincantationof binom.test,thefirstargumentisthenumberofsuccessesinaBernoullitrial(thenumber ofheads),andthesecondargumentisthenumberoftrialsinthesample(thenumberof coinflips). >binom.test(10,30) Exactbinomialtest data:10and30 numberofsuccesses=10,numberoftrials=30,p-value=0.09874 alternativehypothesis:trueprobabilityofsuccessisnotequalto0.5 95percentconfidenceinterval: 0.17287420.5281200 sampleestimates: probabilityofsuccess 0.3333333 Ifyoustudytheoutput,you’llseethatthep-valuedoesnotcrossthesignificance threshold. Now,supposethatLarrysaidthatthecoinwasnotbiasedtowardstails.ToseeifLarry waslying,weonlywanttotestthealternativehypothesisthattheprobabilityofheadsis lessthan.5.Inthatcase,wewouldsetupourhypotheseslikethis: H0:Theprobabilityofheadsisgreaterthanorequalto.5 H1:Theprobabilityofheadsislessthan.5 Thisiscalledadirectionalhypothesis,becausewehaveahypothesisthatassertsthatthe deviationfromchancegoesinaparticulardirection.Inthishypothesissuite,weareonly testingwhethertheobservedprobabilityofheadsfallsintoacriticalregionononlyone sideofthesamplingdistributionoftheteststatistic.Thestatisticaltestthatwewould performinthiscaseis,therefore,calledaone-tailedtest—thecriticalregiononlylieson onetail.Sincetheareaofthecriticalregionnolongerhastobedividedbetweenthetwo tails(likeinthetwo-tailedtestweperformedearlier),thecriticalregiononlycontainsthe areatotheleftofthe.05quantile. Figure6.2:Thethreepanels,fromlefttoright,depictthecriticalregionsoftheleft (“lesser”)one-tailed,two-tailed,andright(“greater”)alternativehypotheses.The dashedhorizontallineismeanttoshowthat,forthetwo-tailedtests,thecriticalregion startsbelowp=.025,becauseitisbeingsplitbetweentwotails.Fortheone-tailedtests, thecriticalregionisbelowthedashedhorizontallineatp=.05. Asyoucanseefromthefigure,forthedirectionalalternativehypothesisthatheadshasa probabilitylessthan.5,10headsisnowincludedinthegreencriticalregion. Wecanusethebinom.testfunctiontotestthisdirectionalhypothesis,too.Allwehaveto doisspecifytheoptionalparameteralternativeandsetitsvalueto"less"(itsdefaultis "two.sided"foratwo-tailedtest). >binom.test(10,30,alternative="less") Exactbinomialtest data:10and30 numberofsuccesses=10,numberoftrials=30,p-value=0.04937 alternativehypothesis:trueprobabilityofsuccessislessthan0.5 95percentconfidenceinterval: 0.00000000.4994387 sampleestimates: probabilityofsuccess 0.3333333 Ifwewantedtotestthedirectionalhypothesisthattheprobabilityofheadswasgreater than.5,wewouldusealternative="greater". Takenoteofthefactthatthep-valueisnowlessthan.05.Infact,itisidenticaltothe probabilitywegotfromthepbinomfunction. Whenthingsgowrong Certaintyisacardrarelyusedinthedeckofadataanalyst.Sincewemakejudgmentsand inferencesbasedonprobabilities,mistakeshappen.Inparticular,therearetwotypesof mistakesthatarepossibleinNHST:TypeIerrorsandTypeIIerrors. ATypeIerroriswhenahypothesistestconcludesthatthereisaneffect(rejectsthe nullhypothesis)when,inreality,nosucheffectexists ATypeIIerroroccurswhenwefailtodetectarealeffectintheworldandfailto rejectthenullhypothesisevenifitisfalse Checkthefollowingtableforerrorsencounteredinthecoinexample: Cointype Failuretorejectnullhypothesis(concludeno detectableeffect) Rejectthenullhypothesis(concludethatthere isaneffect) Coinis fair Correctpositiveidentification TypeIerror(falsepositive) Coinis unfair TypeIIerror(falsenegative) Correctidentification Inthecriminaljusticesystem,TypeIerrorsareconsideredespeciallyheinous.Legal theoristWilliamBlackstoneisfamousforhisquote:itisbetterthattenguiltypersons escapethanoneinnocentsuffer.Thisiswhythecourtinstructsjurors(intheUnited States,atleast)toonlyconvictthedefendantifthejurybelievesthedefendantisguilty beyondareasonabledoubt.Theconsequenceisthatifthejuryfavorsthehypothesisthat thedefendantisguilty,butonlybyalittlebit,thejurymustgivethedefendantthebenefit ofthedoubtandacquit. Thislineofreasoningholdsforhypothesistestingaswell.Sciencewouldbeinasorry stateifweacceptedalternativehypothesesonratherflimsyevidencewilly-nilly;itis betterthatweerronthesideofcautionwhenmakingclaimsabouttheworld,evenifthat meansthatwemakefewerdiscoveriesofhonest-to-goodness,real-worldphenomena becauseourstatisticaltestsfailedtoreachsignificance. Thissentimentunderliesthatdecisiontouseanalphalevellike.05.Analphalevelof.05 meansthatwewillonlycommitaTypeIerror(falsepositive)5%ofthetime.Ifthealpha levelwerehigher,wewouldmakefewerTypeIIerrors,butatthecostofmakingmore TypeIerrors,whicharemoredangerousinmostcircumstances. Thereisasimilarmetrictothealphalevel,anditiscalledthebetalevel(βlevel).Thebeta levelistheprobabilitythatwewouldfailtorejectthenullhypothesisifthealternative hypothesisweretrue.Inotherwords,itistheprobabilityofmakingaTypeIIerror. Thecomplementofthebetalevel,1minusthebetalevel,istheprobabilityofcorrectly detectingatrueeffectifoneexists.Thisiscalledpower.Thisvariesfromtesttotest. Computingthepowerofatest,atechniquecalledpoweranalysis,isatopicbeyondthe scopeofthisbook.Forourpurposes,itwillsufficetosaythatitdependsonthetypeof testbeingperformed,thesamplesizebeingused,andonthesizeoftheeffectthatisbeing tested(theeffectsize).Greatereffects,liketheaveragedifferenceinheightbetween womenandmen,arefareasiertodetectthansmalleffects,liketheaveragedifferencein thelengthofearthwormsinCarlisleandinBirmingham.Statisticiansliketoaimfora powerofatleast80%(abetalevelof.2).Atestthatdoesn’treachthislevelofpower (becauseofasmallsamplesizeorsmalleffectsize,andsoon)issaidtobeunderpowered. Awarningaboutsignificance It’sperhapsregrettablethatweusethetermsignificanceinrelationtonull-hypothesis testing.Whenthetermwasfirstusedtodescribehypothesistests,thewordsignificance waschosenbecauseitsignifiedsomething.AsIwrotethischapter,Icheckedthe thesaurusforthewordsignificant,anditindicatedthatsynonymsincludenotable,worthy ofattention,andimportant.Thisismisleadinginthatitisnotequivalenttoitsintended, vestigialmeaning.Onethingthatreallyconfusespeopleisthattheythinkstatistical significanceisofgreatimportanceinandofitself.Thisissadlyuntrue;thereareafew waystoachievestatisticalsignificancewithoutdiscoveringanythingofsignificance,in thecolloquialsense. Aswe’llseelaterinthechapter,onewaytoachievenon-significantstatisticalsignificance isbyusingaverylargesamplesize.Verysmalldifferences,thatmakelittletono differenceintherealworld,willneverthelessbeconsideredstatisticallysignificantifthere isalargeenoughsamplesize. Forthisreason,manypeoplemakethedistinctionbetweenstatisticalsignificanceand practicalsignificanceorclinicalrelevance.Manyholdtheviewthathypothesistesting shouldonlybeusedtoanswerthequestionisthereaneffect?oristhereadiscernable difference?,andthatthefollow-upquestionsisitimportant?ordoesitmakeareal difference?shouldbeaddressedseparately.Isubscribetothispointofview. Toanswerthefollow-upquestions,manyuseeffectsizes,which,asweknow,capturethe magnitudeofaneffectintherealworld.Wewillseeanexampleofdeterminingtheeffect sizeinatestlaterinthischapter. Awarningaboutp-values P-valuesare,byfar,themosttalkedaboutmetricinNHST.P-valuesarealsonotoriousfor lendingthemselvestomisinterpretation.OfthemanycriticismsofNHST(ofwhichthere aremany,inspiteofitsubiquity),themisinterpretationofp-valuesrankshighly.The followingaretwoofthemostcommonmisinterpretations: 1. Ap-valueistheprobabilitythatthenullhypothesisistrue.Thisisnotthecase. Someonemisinterpretingthep-valuefromourfirstbinomialtestmightconcludethat thechancesofthecoinbeingfairarearound10%.Thisisfalse.Thep-valuedoesnot tellustheprobabilityofthehypothesis’truthorfalsity.Infact,thetestassumesthat thenullhypothesisiscorrect.Ittellsustheproportionoftrialsforwhichwewould receivearesultasextremeormoreextremethantheonewedidifthenullhypothesis wascorrect.I’mashamedtoadmitit,butImadethismistakeduringmyfirstcollege introductorystatisticsclass.Inmyfinalprojectfortheclass,afterweeksofcollecting data,Ifoundmyp-valuehadnotpassedthebarrierofsignificance—itwassomething like.07.Iaskedmyprofessorif,afterthefact,Icouldchangemyalphalevelto.1so myresultswouldbepositive.Inmyrequest,Iappealedtothefactthatitwasstill moreprobablethannotthatmyalternativehypothesiswascorrect—afterall,ifmypvaluewas.07,thentherewasa93%chancethatthealternativehypothesiswas correct.Hesmiledandtoldmetoreadtherelevantchapterofourtextagain.I appreciatehimforhispatienceandrestraintinnotsmackingmerightintheheadfor makingsuchastupidmistake.Don’tbelikeme. 2. Ap-valueisameasureofthesizeofaneffect.Thisisalsoincorrect,butits wrongnessismoresubtlethanthefirstmisconception.Inresearchpapers,itis commontoattachphraseslikehighlysignificantandveryhighlysignificanttopvaluesthataremuchsmallerthan.05(like.01and.001).Itiscommontointerpretpvaluessuchasthese,andstatementssuchasthese,assignalingabiggereffectthanpvaluesthatareonlymodestlylessthan.05.Thisisamistake;thisisconflating statisticalsignificancewithpracticalsignificance.Intheprevioussection,we explainedthatyoucanachievesignificantp-values(sometimesveryhighly significantones)foraneffectthatis,forallintentsandpurposes,smalland unimportant.Wewillseeaverysalientexampleofthislaterinthischapter. Testingthemeanofonesample Anillustrativeandfairlycommonstatisticalhypothesistestistheonesamplet-test.You useitwhenyouhaveonesampleandyouwanttotestwhetherthatsamplelikelycame fromapopulationbycomparingthemeanagainsttheknownpopulationmean.Forthis testtowork,youhavetoknowthepopulationmean. Inthisexample,we’llbeusingR’sbuilt-inprecipdatasetthatcontainsprecipitationdata from70UScities. >head(precip) MobileJuneauPhoenixLittleRockLosAngelesSacramento 67.054.77.048.514.017.2 Don’tbefooledbythefactthattherearecitynamesinthere—thisisaregularoldvectorit’sjustthattheelementsarelabeled.Wecandirectlytakethemeanofthisvector,justlike anormalone. >is.vector(precip) [1]TRUE >mean(precip) [1]34.88571 Let’spretendthatwe,somehow,knowthemeanprecipitationoftherestoftheworld—is theUS’precipitationsignificantlydifferenttotherestoftheworld’sprecipitation? Remember,inthelastchapter,Isaidthatthesamplingdistributionofsamplemeansfor samplesizesunder30werebestapproximatedbyusingat-distribution.Well,thistestis calledat-test,becauseinordertodecidewhetheroursamples’meanisconsistentwiththe populationwhosemeanwearetestingagainst,weneedtoseewhereourmeanfallsin relationtothesamplingdistributionofpopulationmeans.Ifthisisconfusing,rereadthe relevantsectionfromthepreviouschapter. Inordertousethet-testingeneralcases—regardlessofthescale—insteadofworking withthesamplingdistributionofsamplemeans,weworkwiththesamplingdistributionof thet-statistic. Rememberz-scoresfromChapter3,DescribingRelationships?Thet-statisticislikeazscoreinthatitisascale-lessmeasureofdistancefromsomemean.Inthecaseofthetstatistic,though,wedividebythestandarderrorinsteadofthestandarddeviation(because thestandarddeviationofthepopulationisunknown).Sincethet-statisticisstandardized, anypopulation,withanymean,usinganyscale,willhaveasamplingdistributionofthetstatisticthatisexactlythesame(atthesamesamplesize,ofcourse). Theequationtocomputethet-statisticisthis: where isthesamplemean,µisthepopulationmean,sisthesample’standarddeviation, andNisthesamplesize. Let’sseeforourselveswhatthesamplingdistributionofthet-statisticlookslikebytaking 10,000samplesofsize70(thesamesizeasourprecipdataset)andplottingtheresults: #functiontocomputet-statistic t.statistic<-function(thesample,thepopulation){ numerator<-mean(thesample)-mean(thepopulation) denominator<-sd(thesample)/sqrt(length(thesample)) t.stat<-numerator/denominator return(t.stat) } #makethepretendpopulationnormallydistributed #withameanof38 population.precipitation<-rnorm(100000,mean=38) t.stats<-numeric(10000) for(iin1:10000){ a.sample<-sample(population.precipitation,70) t.stats[i]<-t.statistic(a.sample,population.precipitation) } #plot library(ggplot2) tmpdata<-data.frame(vals=t.stats) qplot(vals,data=tmpdata,geom="histogram", color=I("white"), xlab="samplingdistributionoft-statistic", ylab="frequency") Figure6.3:Thesamplingdistributionofthet-statistic Ah,there’sthatfamiliarshapeagain! Fortunately,thesamplingdistributionofthet-statisticiswellknown,sowedon’thave tocreateourown.Infact,thesamplingdistributionformanyteststatisticsarewell known,sowewon’tberunningourownsimulationsofthemanymore.Luckyus! Okay,sohowdoesoursample’st-statisticcomparetothet-distribution?Ourt-statistic, usingourfunctionfromthelastcode-snippet,is: >t.statistic(precip,population.precipitation) [1]-1.901225 Though,youcanworkthisoutforyourselfeasily. Figure6.4:Thet-distributionwith69degreesoffreedom.Thet-statisticofoursampleis shownasthedashedline Hmm,itlookslikeaprettyunlikelyoccurrencetome,butisitstatisticallysignificant? First,let’sformallydefineourhypotheses: H0=theaverage(mean)precipitationintheUSisequaltotheknownaverage precipitationintherestoftheworld H1=theaverage(mean)precipitationintheUSisdifferentthantheknownaverage precipitationintherestoftheworld Then,weprespecifyanalphalevelof.05,asiscustomary. Sinceourhypothesisisnon-directional(weonlyhypothesizethattheprecipitationinthe USisdifferentthantheworld,notlessormore),wedefineourcriticalregiontocover5% oftheareaoneachsideofthecurve. >qt(.025,df=69) [1]-1.994945 >#thecriticalregionislessthan-1.995andmorethan+1.995 Whatdoesitlooklikenow? Figure6.5:Thepreviousfigurewiththecriticalregionfornon-directionalhypothesis highlighted Oh,toobad!Itlookslikeoursamplemeanfallsjustoutofthecriticalregion.So,wefail torejectthenullhypothesis. Thecrueltruthifwe,forsomereason,hypothesizedthattheUSprecipitationwasless thantheaverageworldprecipitationis: H0=meanUSprecipitation>=meanworldprecipitation H1=meanUSprecipitation<meanworldprecipitation Wewouldhaveachievedsignificanceatalpha=.05. Figure6.6:Figure6.4withdirectionalcriticalregionhighlighted Ofcourse,wehavenoreasontothinkthatUSprecipitationwaslessormorethanthe world’saverage.Andtochangeourhypothesisnowwouldbecheating.You’renota cheater,areyou? Nowthatweknowwhatwe’redoing,wewon’tbemanuallycalculatingourteststatistics anymore;we’lljustbeusingthetestfunctionsthatRprovides. Let’susethefunctionthatRprovidesnow.Theonesamplet-testcanbeperformedbythe t.testfunction.Initsmostbasicform,ittakesavectorofsampleobservationsasitsfirst argumentandthepopulationmeanasitssecondargument.. >t.test(precip,mu=38) OneSamplet-test data:precip t=-1.901,df=69,p-value=0.06148 alternativehypothesis:truemeanisnotequalto38 95percentconfidenceinterval: 31.6174838.15395 sampleestimates: meanofx 34.88571 Amongotherthings,thistesttellsusthatthet-statisticis1.9(justlikewecalculated ourselves),thedegreesoffreedomwere69(thesamplesizeminus1),andthep-value, whichis0.06148.Likeourplotwiththetwo-tailedcriticalregionsshowed,thisp-valueis greaterthanourprespecifiedalphalevelof0.05.Wefailtorejectthenullhypothesis. Justforkicks,let’sruntheone-tailedhypothesistest: >t.test(precip,mu=38,alternative="less") OneSamplet-test data:precip t=-1.901,df=69,p-value=0.03074 alternativehypothesis:truemeanislessthan38 95percentconfidenceinterval: -Inf37.61708 sampleestimates: meanofx 34.88571 Nowourp-valueis<.05.C’estlavie. Note NotethattheRoutputindicatesthatthealternativehypothesiswhichisthetruemeanis lessthan38—comparethiswiththelastt-testoutput. Assumptionsoftheonesamplet-test Therearetwomainassumptionsoftheonesamplet-test: Thedataaresampledfromanormaldistribution.Thisactuallyhasmoretodowith thesamplingdistributionofsamplemeansbeingapproximatelynormalthanthe actualpopulation.Asweknow,thesamplingdistributionofsamplemeansfor sufficientlylargesamplesizeswillalwaysbenormallydistributed,evenifthe populationisnot.Inreality,thisassumptioncanbeviolatedsomewhat,andthe resultswillbevalid,especiallyforsamplesizesofover30.Wehavenothingtoworry abouthere.Usually,peoplecheckthisassumptionbyplottingthesamplemeansand makingsureit’skind-ofnormal,thoughtherearemoreformalwaysofdoingthis, whichwewillseelater.Iftheassumptionofnormalityisinquestion,wemaywantto useanalternativetest,likeanon-parametrictest;we’llseesomeexamplesattheend ofthischapter. Independenceofsamples:HadwetestedwhethertheUSprecipitationlikelycame fromthepopulationoftheentireworld’sprecipitation,wewouldhavebeenviolating thisassumption.Why?BecauseweknowthattheUSisamemberoftheset(itis indeed‘intheworld’),soofcourseitwasdrawnfromthatpopulation.Thisiswhy wetestedwhethertheUSprecipitationwasonparwiththerestoftheworld’s precipitation.Inotherexamplesoftheonesamplet-tests,thisassumptionbasically requiresthatthesampleberandom. Testingtwomeans Anevenmorecommonhypothesistestistheindependentsamplest-test.Youwoulduse thistochecktheequalityoftwosamples’means.Concretely,anexampleofusingthistest wouldbeifyouhaveanexperimentwhereyouaretestingtoseeifanewdruglowers bloodpressure.Youwouldgiveonegroupaplaceboandtheothergroupthereal medication.Ifthemeanimprovementinbloodpressurewassignificantlygreaterthanthe improvementwiththeplacebo,youmightinferthatthebloodpressuremedicationworks. Outsideofmoreacademicuses,webcompaniesusethistestallthetimetotestthe effectivenessof,forexample,differentinternetadcampaigns;theyexposerandomusers toeitheroneoftwotypesofadsandtestifoneismoreeffectivethantheother.Inwebbusinessparlance,thisiscalledanA-Btest,butthat’sjustbusiness-eseforcontrolled experiment. Thetermindependentmeansthatthetwosamplesareseparate,andthatdatafromone sampledoesn’taffectdataintheother.Forexample,ifinsteadofhavingtwodifferent groupsinthebloodpressuretrial,weusedthesameparticipantstotestboththeconditions (randomizingtheorderweadministertheplaceboandtherealmedication),wewould violateindependence. ThedatasetwewillbeusingforthisisthemtcarsdatasetthatwefirstmetinChapter2, TheShapeofDataandsawagaininChapter3,DescribingRelationships.Specifically,we aregoingtotestthehypothesisthatthemileageisbetterformanualcarsthanitisforcars withautomatictransmission.Let’scomparethemeansandproduceaboxplot: >mean(mtcars$mpg[mtcars$am==0]) [1]17.14737 >mean(mtcars$mpg[mtcars$am==1]) [1]24.39231 > >mtcars.copy<-mtcars >#makenewcolumnwithbetterlabels >mtcars.copy$transmission<-ifelse(mtcars$am==0, "auto","manual") >mtcars.copy$transmission<-factor(mtcars.copy$transmission) >qplot(transmission,mpg,data=mtcars.copy, +geom="boxplot",fill=transmission)+ +#nolegend +guides(fill=FALSE) Figure6.7:Boxplotofthemilespergallonratingsforautomaticcarsandcarswith manualtransmission Hmm,looksdifferent…butlet’scheckthathypothesisformally.Ourhypothesesare: H0=meanofsample1-meanofsample2>=0 H1=meanofsample1-meanofsample2<0 Todothis,weusethet.testfunction,too;onlythistime,weprovidetwovectors:one foreachsample.Wealsospecifyourdirectionalhypothesisinthesameway: >automatic.mpgs<-mtcars$mpg[mtcars$am==0] >manual.mpgs<-mtcars$mpg[mtcars$am==1] >t.test(automatic.mpgs,manual.mpgs,alternative="less") WelchTwoSamplet-test data:automatic.mpgsandmanual.mpgs t=-3.7671,df=18.332,p-value=0.0006868 alternativehypothesis:truedifferenceinmeansislessthan0 95percentconfidenceinterval: -Inf-3.913256 sampleestimates: meanofxmeanofy 17.1473724.39231 p<.05.Yipee! Thereisaneasierwaytousethet-testforindependentsamplesthatdoesn’trequireusto maketwovectors. >t.test(mpg~am,data=mtcars,alternative="less") Thisreads,roughly,performat-testofthempgcolumngroupingbytheamcolumninthe dataframemtcars.Confirmforyourselfthattheseincantationsareequivalent. Don’tbefooled! RememberwhenIsaidthatstatisticalsignificancewasnotsynonymouswithimportant andthatwecanuseverylargesamplesizestoachievestatisticalsignificancewithoutany clinicalrelevance?Checkthissnippetout: >set.seed(16) >t.test(rnorm(1000000,mean=10),rnorm(1000000,mean=10)) WelchTwoSamplet-test data:rnorm(1e+06,mean=10)andrnorm(1e+06,mean=10) t=-2.1466,df=1999998,p-value=0.03183 alternativehypothesis:truedifferenceinmeansisnotequalto0 95percentconfidenceinterval: -0.0058104638-0.0002640601 sampleestimates: meanofxmeanofy 9.99791610.000954 Here,twovectorsofonemillionnormaldeviateseacharecreatedwithameanof10. Whenweuseat-testonthesetwovectors,itshouldindicatethatthetwovectors’means arenotsignificantlydifferent,right? Well,wegotap-valueoflessthat.05—why?IfyoulookcarefullyatthelastlineoftheR output,youmightseewhy;themeanofthefirstvectoris9.997916,andthemeanofthe secondvectoris10.000954.Thistinydifference,ameagre.003,isenoughtotipthescale intosignificantterritory.However,Icanthinkofveryfewapplicationsofstatisticswhere .003ofanythingisnoteworthyeventhoughitis,technically,statisticallysignificant. Thelargerpointisthatthet-testtestsforequalityofmeans,andifthemeansaren’t exactlythesameinthepopulation,thet-testwill,withenoughpower,detectthis.Notall tinydifferencesinpopulationmeansareimportant,though,soitisimportanttoframethe resultsofat-testandthep-valueincontext. Asmentionedearlierinthechapter,asalientstrategyforputtingthedifferencesincontext istouseaneffectsize.Theeffectsizecommonlyusedinassociationwiththet-testis Cohen’sd.Cohen’sdis,conceptually,prettysimple:itisaratioofthevarianceexplained bythe“effect”andthevarianceinthedataitself.Concretely,Cohen’sdisthedifference inmeansdividedbythesamplestandarddeviation.Ahighdindicatesthatthereisabig effect(differenceinmeans)relativetotheinternalvariabilityofthedata. Imentionedthattocalculated,youhavetodividethedifferenceinmeansbythesample standarddeviation—butwhichone?AlthoughCohen’sdisconceptuallystraightforward (evenelegant!),itisalsosometimesapaintocalculatebyhand,becausethesample standarddeviationfrombothsampleshastobepooled.Fortunately,there’sanRpackage thatlet’suscalculateCohen’sd—andothereffectsizemetrics,toboot,quiteeasily.Let’s useitontheautovs.manualtransmissionexample: >install.packages("effsize") >library(effsize) >cohen.d(automatic.mpgs,manual.mpgs) Cohen'sd destimate:-1.477947(large) 95percentconfidenceinterval: infsup -2.3372176-0.6186766 Cohen’sdis-1.478,whichisconsideredaverylargeeffectsize.Thecohen.dfunction eventellsyouthisbyusingcannedinterpretationsofeffectsizes.Ifyoutrythiswiththe twomillionelementvectorsfromabove,thecohen.dfunctionwillindicatethattheeffect wasnegligible. Althoughthesecannedinterpretationswereontargetthesetwotimes,makesureyou evaluateyourowneffectsizesincontext. Assumptionsoftheindependentsamplest-test Homogeneityofvariance(orhomoscedasticity-ascarysoundingword),inthiscase, simplymeansthatthevarianceinthemilespergallonoftheautomaticcarsisthesameas thevarianceinmilespergallonofthemanualcars.Inreality,thisassumptioncanbe violatedaslongasyouuseaWelch’sT-testlikewedid,insteadoftheStudent’sT-test.You canstillusetheStudent’sT-testwiththet.testfunction,likebyspecifyingtheoptional parametervar.equal=TRUE.Youcantestforthisformallyusingvar.testorleveneTest fromthecarpackage.Ifyouaresurethattheassumptionofhomoscedasticityisnot violated,youmaywanttodothisbecauseitisamorepowerfultest(fewerTypeIIerrors). Nevertheless,IusuallyuseWelch’sT-testtobeonthesafeside.Also,alwaysuseWelch’s testifthetwosamples’sizesaredifferent. Thesamplingdistributionofthesamplemeansisapproximatelynormal:Again,with alargeenoughsamplesize,italwaysis.Wedon’thaveaterriblylargesamplesize here,butinreality,thisformulationofthet-testworksevenifthisassumptionis violatedalittle.Wewillseealternativesinduetime. Independence:LikeImentionedearlier,sincethesamplescontaincompletely differentcars,we’reokayonthisfront.Forteststhat,forexample,usethesame participantsforbothconditions,youwoulduseaDependentSamplesT-testorPaired SamplesT-test,whichwewillnotdiscussinthisbook.Ifyouareinterestedin runningoneofthesetestsaftersomeresearch,uset.test(<vector1>,<vector2>, paired=TRUE). Testingmorethantwomeans Anotherreallycommonsituationrequirestestingwhetherthreeormoremeansare significantlydiscrepant.Wewouldfindourselvesinthissituationifwehadthree experimentalconditionsinthebloodpressuretrial:onegroupsgetsaplacebo,onegroup getsalowdoseoftherealmedication,andonegroupsgetsahighdoseofthereal medication. Hmm,forcaseslikethese,whydon’twejustdoaseriesoft-tests?Forexample,wecan testthedirectionalalternativehypotheses: ThelowdoseofbloodpressuremedicationlowersBPsignificantlymorethanthe placebo ThehighdoseofbloodpressuremedicationlowersBPsignificantlymorethanthe lowdose Well,itturnsoutthatdoingthisfirstisprettydangerousbusiness,andthelogicgoeslike this:ifouralphalevelis0.05,thenthechancesofmakingaTypeIerrorforonetestis 0.05;ifweperformtwotests,thenourchancesofmakingaTypeIerrorissuddenly .09025(near10%).Bythetimeweperform10testsatthatalphalevel,thechancesofus havingmakingaTypeIerroris40%.Thisiscalledthemultipletestingproblemor multiplecomparisonsproblem. Tocircumventthisproblem,inthecaseoftestingthreeormoremeans,weuseatechnique calledAnalysisofVariance,orANOVA.AsignificantresultfromanANOVAleadstothe inferencethatatleastoneofthemeansissignificantlydiscrepantfromoneoftheother means;itdoesnotlenditselftotheinferencethatallthemeansaresignificantlydifferent. Thisisanexampleofanomnibustest,becauseitisaglobaltestthatdoesn’ttellyou exactlywherethedifferencesare,justthattherearedifferences. YoumightbewonderingwhyatestofequalityofmeanshasanamecalledAnalysisof Variance;it’sbecauseitdoesthisbycomparingthevariancebetweencasestothe variancewithincases.ThegeneralintuitionbehindanANOVAisthatthehighertheratio ofvariancebetweenthedifferentgroupsthanwithinthedifferentgroups,thelesslikely thatthedifferentgroupsweresampledfromthesamepopulation.ThisratioiscalledanF ratio. ForourdemonstrationofthesimplestspeciesofANOVA(theone-wayANOVA),weare goingtobeusingtheWeightLossdatasetfromthecarpackage.Ifyoudon’thavethecar package,installit. >library(car) >head(WeightLoss) groupwl1wl2wl3se1se2se3 1Control433141315 2Control443131417 3Control431171216 4Control321111112 5Control532161514 6Control654171818 > >table(WeightLoss$group) ControlDietDietEx 121210 TheWeightLossdatasetcontainspoundslostandselfesteemmeasurementsforthree weeksforthreedifferentgroups:acontrolgroup,onegroupjustonadiet,andonegroup thatdietedandexercised.Wewillbetestingthehypothesisthatthemeansoftheweight lossatweek2arenotallequal: H0=themeanweightlossatweek2betweenthecontrol,dietgroup,anddietand exercisegroupareequal H1=atleasttwoofthemeansofweightlossatweek2betweenthecontrol,diet group,anddietandexercisegrouparenotequal Beforethetest,let’scheckoutaboxplotofthemeans: >qplot(group,wl2,data=WeightLoss,geom="boxplot",fill=group) Figure6.8:Boxplotofweightlostinweek2oftrialforthreegroups:control,diet,and diet&exercise NowfortheANOVA… >the.anova<-aov(wl2~group,data=WeightLoss) >summary(the.anova) DfSumSqMeanSqFvaluePr(>F) group245.2822.64113.376.49e-05*** Residuals3152.481.693 --Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1 Oh,snap!Thep-value(Pr(>F))is6.49e-05,whichis.000065ifyouhaven’tread scientificnotationyet. AsIsaidbefore,thisjustmeansthatatleastoneofthecomparisonsbetweenmeanswas significant—therearefourwaysthatthiscouldoccur: Themeansofdietanddietandexercisearedifferent Themeansofdietandcontrolaredifferent Themeansofcontrolanddietandexercisearedifferent Themeansofcontrol,diet,anddietandexercisearealldifferent Inordertoinvestigatefurther,weperformapost-hoctest.Quiteoften,thepost-hoctest thatanalystsperformisasuiteoft-testscomparingeachpairofmeans(pairwiset-tests). Butwait,didn’tIsaythatwasdangerousbusiness?Idid,butit’sdifferentnow: Wehavealreadyperformedanhonest-to-goodnessomnibustestatthealphalevelof ourchoosing.Onlyafterweachievesignificancedoweperformpairwiset-tests. Wecorrectfortheproblemofmultiplecomparisons TheeasiestmultiplecomparisoncorrectingproceduretounderstandisBonferroni correction.Initssimplestversion,itsimplychangesthealphavaluebydividingitbythe numberoftestsbeingperformed.Itisconsideredthemostconservativeofallthemultiple comparisoncorrectionmethods.Infact,manyconsiderittooconservativeandI’m inclinedtoagree.Instead,IsuggestusingacorrectingprocedurecalledHolm-Bonferroni correction.Rusesthisbydefault. >pairwise.t.test(WeightLoss$wl2,as.vector(WeightLoss$group)) PairwisecomparisonsusingttestswithpooledSD data:WeightLoss$wl2andas.vector(WeightLoss$group) ControlDiet Diet0.28059DietEx7.1e-050.00091 Pvalueadjustmentmethod:holm ThisoutputindicatesthatthedifferenceinmeansbetweentheDietandDietandexercise groupsisp<.001.Additionally,itindicatesthatthedifferencebetweenDietand exerciseandControlisp<.0001(lookatthecellwhereitsays7.1e-05).Thep-valueof thecomparisonofjustdietandthecontrolis.28,sowefailtorejectthehypothesisthat theyhavethesamemean. AssumptionsofANOVA Thestandardone-wayANOVAmakesthreemainassumptions: Theobservationsareindependent Thedistributionoftheresiduals(thedistancesbetweenthevalueswithinthegroups totheirrespectivemeans)isapproximatelynormal Homogeneityofvariance:Ifyoususpectthatthisassumptionisviolated,youcanuse R’soneway.testinstead Testingindependenceofproportions RemembertheUniversityofCaliforniaBerkeleydatasetthatwefirstsawwhendiscussing therelationshipbetweentwocategoricalvariablesinChapter3,DescribingRelationships. RecallthatUCBwassuedbecauseitappearedasthoughtheadmissionsdepartment showedpreferentialtreatmenttomaleapplicants.Alsorecallthatweusedcross-tabulation tocomparetheproportionofadmissionsacrosscategories. Ifadmissionrateswere,say10%,youwouldexpectaboutoneoutofeverytenapplicants tobeacceptedregardlessofgender.Ifthisisthecase—thatgenderhasnobearingonthe proportionofadmits—thengenderisindependent. Smalldeviationsfromthis10%proportionare,ofcourse,tobeexpectedintherealworld andnotnecessarilyindicativeofasexistadmissionsmachine.However,ifatestof independenceofproportionsissignificant,thatindicatesthatadeviationasextremeasthe oneweobservedisveryunlikelytooccurifthevariableweretrulyindependent. Ateststatisticthatcapturesdivergencefromanidealized,perfectlyindependentcross tabulationisthechi-squaredstatistic statistic),anditssamplingdistributionisknown asachi-squaredistribution.Ifourchi-squarestatisticfallsintothecriticalregionofthe chi-squaredistributionwiththeappropriatedegreesoffreedom,thenwerejectthe hypothesisthatgenderisanindependentfactorinadmissions. Let’sperformoneofthesechi-squaretestsonthewholeUCBAdmissionsdataset. >#Thechi-squaretestfunctiontakesacross-tabulation >#whichUCBAdmissionsalreadyis.Iamconvertingitfrom >#andbacksothatyou,dearreader,canlearnhowtodo >#thiswithotherdatathatisn'talreadyincross-tabulation >#form >ucba<-as.data.frame(UCBAdmissions) >head(ucba) AdmitGenderDeptFreq 1AdmittedMaleA512 2RejectedMaleA313 3AdmittedFemaleA89 4RejectedFemaleA19 5AdmittedMaleB353 6RejectedMaleB207 > >#createcross-tabulation >cross.tab<-xtabs(Freq~Gender+Admit,data=ucba) > >chisq.test(cross.tab) Pearson'sChi-squaredtestwithYates'continuitycorrection data:cross.tab X-squared=91.6096,df=1,p-value<2.2e-16 Theproportionsarealmostcertainlynotindependent(p<.0001).Beforeyouconclude thattheadmissionsdepartmentissexist,rememberSimpson’sParadox?Ifyoudon’t, rereadtherelevantsectioninChapter3,DescribingRelationships. Sincethechi-squareindependenceofproportiontestcanbe(andisoftenused)tocompare awholemessofproportions,it’ssometimesreferredtoanomnibustest,justlikethe ANOVA.Itdoesn’ttelluswhatproportionsaresignificantlydiscrepant,onlythatsome proportionsare. Whatifmyassumptionsareunfounded? Thet-testandANOVAarebothconsideredparametricstatisticaltests.Theword parametricisusedindifferentcontextstosignaldifferentthingsbut,essentially,itmeans thatthesetestsmakecertainassumptionsabouttheparametersofthepopulation distributionsfromwhichthesamplesaredrawn.Whentheseassumptionsaremet(with varyingdegreesoftolerancetoviolation),theinferencesareaccurate,powerful(inthe statisticalsense),andareusuallyquicktocalculate.Whenthoseparametricassumptions areviolated,though,parametrictestscanoftenleadtoinaccurateresults. We’vespokenabouttwomainassumptionsinthischapter:normalityandhomogeneityof variance.Imentionedthat,eventhoughyoucantestforhomogeneityofvariancewiththe leveneTestfunctionfromthecarpackage,thedefaultt.testinRremovesthis restriction.Ialsomentionedthatyoucouldusetheoneway.testfunctioninlieuofaovif youdon’thavetohavetoadheretothisassumptionwhenperforminganANOVA.Dueto theseaffordances,I’lljustfocusontheassumptionofnormalityfromnowon. Inat-test,theassumptionthatthesampleisanapproximatelynormaldistributioncanbe visuallyverified,toacertainextent.Thenaïvewayistosimplymakeahistogramofthe data.AmoreproperapproachistouseaQQ-plot(quantile-quantileplot).Youcanview aQQ-plotinRbyusingtheqqPlotfunctionfromthecarpackage.Let’suseittoevaluate thenormalityofthemilespergallonvectorinmtcars. >library(car) >qqPlot(mtcars$mpg) Figure6.9:AQQ-plotofthemilepergallonvectorinmtcars AQQ-plotcanactuallybeusedtocompareanysamplefromanytheoreticaldistribution, butitismostoftenassociatedwiththenormaldistribution.Theplotdepictsthequantiles ofthesampleandthequantilesofthenormaldistributionagainsteachother.Ifthesample wereperfectlynormal,thepointswouldfallonthesolidreddiagonalline—itsdivergence fromthislinesignalsadivergencefromnormality.Eventhoughitisclearthatthe quantilesformpgdon’tpreciselycomportwiththequantilesofthenormaldistribution,its divergenceisrelativelyminor. Themostpowerfulmethodforevaluatingadherencetotheassumptionofnormalityisto useastatisticaltest.WearegoingtousetheShapiro-Wilktest,becauseit’smyfavorite, thoughthereareafewothers. >shapiro.test(mtcars$mpg) Shapiro-Wilknormalitytest data:mtcars$mpg W=0.9476,p-value=0.1229 Thisnon-significantresultindicatesthatthedeviationsfromnormalityarenotstatistically significant. ForANOVAs,theassumptionofnormalityappliestotheresiduals,nottheactualvalues ofthedata.AfterperformingtheANOVA,wecancheckthenormalityoftheresiduals quiteeasily: >#I'mrepeatingtheset-up >library(car) >the.anova<-aov(wl2~group,data=WeightLoss) > >shapiro.test(the.anova$residuals) Shapiro-Wilknormalitytest data:the.anova$residuals W=0.9694,p-value=0.4444 We’reintheclear! Butwhatifwedoviolateourparametricassumptions!?Incaseslikethese,manyanalysts willfallbackonusingnon-parametrictests. Manystatisticaltests,includingthet-testandANOVA,havenon-parametricalternatives. Theappealofthesetestsis,ofcourse,thattheyareresistanttoviolationsofparametric assumptions—thattheyarerobust.Thedrawbackisthatthesetestsareusuallyless powerfulthantheirparametriccounterparts.Inotherwords,theyhaveasomewhat diminishedcapacityfordetectinganeffectiftheretrulyisonetodetect.Forthisreason,if youaregoingtouseNHST,youshouldusethemorepowerfultestsbydefault,andswitch onlyifyou’reassumptionsareviolated. Thenon-parametricalternativetotheindependentt-testiscalledtheMann-WhitneyUtest, thoughitisalsoknownastheWilcoxonrank-sumtest.Asyoumightexpectbynow,there isafunctiontoperformthistestinR.Let’suseitontheautovs.manualtransmission example: >wilcox.test(automatic.mpgs,manual.mpgs) Wilcoxonranksumtestwithcontinuitycorrection data:automatic.mpgsandmanual.mpgs W=42,p-value=0.001871 alternativehypothesis:truelocationshiftisnotequalto0 Simple! Thenon-parametricalternativetotheone-wayANOVAiscalledtheKruskal-Wallistest. CanyouseewhereI’mgoingwiththis? >kruskal.test(wl2~group,data=WeightLoss) Kruskal-Wallisranksumtest data:wl2bygroup Kruskal-Wallischi-squared=14.7474,df=2,p-value=0.0006275 Super! Exercises Hereareafewexercisesforyoutopractiseandrevisetheconceptslearnedinthischapter: Readaboutdata-dredgingandp-hacking.Whyisitdangerousnottoformulatea hypothesis,setanalphalevel,andsetasamplesizebeforecollectingdataand analyzingresults? Usethecommandlibrary(help="datasets")tofindalistofdatasetsthatRhas alreadybuiltin.Pickafewinterestingones,andformahypothesisabouteachone. Rigorouslydefineyournullandalternativehypothesesbeforeyoustart.Testthose hypothesesevenifitmeanslearningaboutotherstatisticaltests. Howmightyouquantifytheeffectsizeofaone-wayANOVA.Lookupeta-squared ifyougetstuck. Inethics,thedoctrineofmoralrelativismholdsthattherearenouniversalmoral truths,andthatmoraljudgmentsaredependentuponone’scultureorperiodin history.Howcanmoralprogress(theabolitionofslavery,fairertradingpractices)be reconciledwitharelativisticviewofmorality?Ifthereisnoobjectivemoral paradigm,howcancriticismsbelodgedagainstthecurrentviewsofmorality?Why replaceexistingmoraljudgmentswithothersifthereisnostandardtowhichto comparethemtoand,therefore,noreasontopreferoneovertheother. Summary Wecoveredhugegroundinthischapter.Bynow,youshouldbeuptospeedonsomeof themostcommonstatisticaltests.Moreimportantly,youshouldhaveasolidgraspofthe theorybehindNHSTandwhyitworks.Thisknowledgeisfarmorevaluablethan mechanicallymemorizingalistofstatisticaltestsandcluesforwhentouseeach. YoulearnedthatNHSThasitsoriginintestingwhetheraweirdlady’sclaimsabouttasting teaweretrueornot.ThegeneralprocedureforNHSTistodefineyournullandalternative hypotheses,defineandcalculateyourteststatistic,determinetheshapeandparametersof thesamplingdistributionofthatteststatistic,measuretheprobabilitythatyouwould observeateststatisticasormoreextremethantheoneweobserved(thisisthep-value), anddeterminewhethertorejectorfailtorejectthenullhypothesisbasedonthewhether thep-valuewasbeloworabovethealphalevel. Youthenlearnedaboutonevs.two-tailedtests,TypeIandTypeIIerrors,andgotsome warningsaboutterminologyandcommonNHSTmisconceptions. Then,youlearnedalitanyofstatisticaltests—wesawthattheonesamplet-testisusedin scenarioswherewewanttodetermineifasample’smeanissignificantlydiscrepantfrom someknownpopulationmean;wesawthatindependentsamplest-testsareusedto comparethemeansoftwodistinctsamplesagainsteachother;wesawthatweuseonewayANOVAsfortestingmultiplemeans,whyit’sinappropriatetojustperformabunch oft-tests,andsomemethodsofcontrollingTypeIerrorrateinflation.Finally,youlearned howthechi-squaretestisusedtochecktheindependenceofproportions. Wethendirectlyappliedwhatyoulearnedtoreal,fundataandtestedreal,funhypotheses. Theywerefun…right!? Lastly,wediscussedparametricassumptions,howtoverifythattheyweremet,andone optionforcircumventingtheirviolationatthecostofpower:non-parametrictests.We learnedthatthenon-parametricalternativetotheindependentsamplest-testisavailablein Raswilcox.test,andthenon-parametricalternativetotheone-wayANOVAis availableinRusingthekruskal.testfunction. Inthenextchapter,wewillalsobediscussingmechanismsfortestinghypotheses,butthis time,wewillbeusinganattractivealternativetoNHSTbasedonthefamoustheoremby ReverendThomasBayesthatyoulearnedaboutinChapter4,Probability.You’llseehow thisothermethodofinferenceaddressessomeoftheshortcomings(deservedornot)of NHST,andwhyit’sgainingpopularityinmodernapplieddataanalysis.Seeyouthere! Chapter7.BayesianMethods SupposeIclaimthatIhaveapairofmagicrainbowsocks.IallegethatwheneverIwear thesespecialsocks,Igaintheabilitytopredicttheoutcomeofcointosses,usingfair coins,betterthanchancewoulddictate.Puttingmyclaimtothetest,youtossacoin30 times,andIcorrectlypredicttheoutcome20times.Usingadirectionalhypothesiswith thebinomialtest,thenullhypothesiswouldberejectedatalpha-level0.05.Wouldyou investinmyspecialsocks? Whynot?Ifit’sbecauseyourequirealargerburdenofproofonabsurdclaims,Idon’t blameyou.AsagrandparentofBayesiananalysisPierre-SimonLaplace(who independentlydiscoveredthetheoremthatbearsThomasBayes’name)oncesaid:The weightofevidenceforanextraordinaryclaimmustbeproportionedtoitsstrangeness. Ourpriorbelief—myabsurdhypothesis—issosmallthatitwouldtakemuchstronger evidencetoconvincetheskepticalinvestor,letalonethescientificcommunity. Unfortunately,ifyou’dliketoeasilyincorporateyourpriorbeliefsintoNHST,you’reout ofluck.Orsupposeyouneedtoassesstheprobabilityofthenullhypothesis;you’reoutof luckthere,too;NHSTassumesthenullhypothesisandcan’tmakeclaimsaboutthe probabilitythataparticularhypothesisistrue.Incaseslikethese(andingeneral),you maywanttouseBayesianmethodsinsteadoffrequentistmethods.Thischapterwilltell youhow.Joinme! ThebigideabehindBayesiananalysis IfyourecallfromChapter4,Probability,theBayesianinterpretationofprobabilityviews probabilityasourdegreeofbeliefinaclaimorhypothesis,andBayesianinferencetellsus howtoupdatethatbeliefinthelightofnewevidence.Inthatchapter,weusedBayesian inferencetodeterminetheprobabilitythatemployeesofDaisyGirl,Inc.wereusingan illegaldrug.Wesawhowtheincorporationofpriorbeliefssavedtwoemployeesfrom beingfalselyaccusedandhelpedanotheremployeegetthehelpsheneededeventhough herdrugscreenwasfalselynegative. Inageneralsense,Bayesianmethodstellushowtodoleoutcredibilitytodifferent hypotheses,givenpriorbeliefinthosehypothesesandnewevidence.Inthedrugexample, thehypothesissuitewasdiscrete:druguserornotdruguser.Morecommonly,though, whenweperformBayesiananalysis,ourhypothesisconcernsacontinuousparameter,or manyparameters.Ourposterior(orupdatedbeliefs)wasalsodiscreteinthedrugexample, butBayesiananalysisusuallyyieldsacontinuousposteriorcalledaposteriordistribution. WearegoingtouseBayesiananalysistoputmymagicalrainbowsocksclaimtothetest. OurparameterofinterestistheproportionofcointossesthatIcancorrectlypredict wearingthesocks;we’llcallthisparameterθ,ortheta.Ourgoalistodeterminewhatthe mostlikelyvaluesofthetaareandwhethertheyconstituteproofofmyclaim. ReferbacktothesectiononBayes’theoreminChapter4,ProbabilityRecallthatthe posteriorwasthepriortimesthelikelihooddividedbyanormalizingconstant.This normalizingconstantisoftendifficulttocompute.Luckily,sinceitdoesn’tchangethe shapeoftheposteriordistribution,andwearecomparingrelativelikelihoodsand probabilitydensities,Bayesianmethodsoftenignorethisconstant.So,allweneedisa probabilitydensityfunctiontodescribeourpriorbeliefandalikelihoodfunctionthat describesthelikelihoodthatwewouldgettheevidencewereceivedgivendifferent parametervalues. Thelikelihoodfunctionisabinomialfunction,asitdescribesthebehaviorofBernoulli trials;thebinomiallikelihoodfunctionforthisevidenceisshowninFigure7.1: Figure7.1:Thelikelihoodfunctionofthetafor20outof30successfulBernoullitrials. Fordifferentvaluesoftheta,therearevaryingrelativelikelihoods.Notethatthevalueof thetathatcorrespondstothemaximumofthelikelihoodfunctionis0.667,whichisthe proportionofsuccessfulBernoullitrials.Thismeansthatintheabsenceofanyother information,themostlikelyproportionofcoinflipsthatmymagicsocksallowmeto predictis67%.ThisiscalledtheMaximumLikelihoodEstimate(MLE). So,wehavethelikelihoodfunction;nowwejustneedtochooseaprior.Wewillbe craftingarepresentationofourpriorbeliefsusingatypeofdistributioncalledabeta distribution,forreasonsthatwe’llseeverysoon. Sinceourposteriorisablendofthepriorandlikelihoodfunction,itiscommonfor analyststouseapriorthatdoesn’tmuchinfluencetheresultsandallowsthelikelihood functiontospeakforitself.Tothisend,onemaychoosetouseanon-informativeprior thatassignsequalcredibilitytoallvaluesoftheta.Thistypeofnon-informativeprioris calledaflatoruniformprior. Thebetadistributionhastwohyper-parameters,α(oralpha)andβ(orbeta).Abeta distributionwithhyper-parametersα=β=1describessuchaflatprior.Wewillcallthis prior#1. Note Theseareusuallyreferredtoasthebetadistribution’sparameters.Wecallthemhyperparametersheretodistinguishthemfromourparameterofinterest,theta. Figure7.2:Aflatprioronthevalueoftheta.Thisbetadistribution,withalphaandbeta= 1,confersanequallevelofcredibilitytoallpossiblevaluesoftheta,ourparameterof interest. Thispriorisn’treallyindicativeofourbeliefs,isit?Dowereallyassignasmuch probabilitytomysocksgivingmeperfectcoin-flippredictionpowersaswedotothe hypothesisthatI’mfullofbaloney? Thepriorthataskepticmightchooseinthissituationisonethatlooksmoreliketheone depictedinFigure7.3,abetadistributionwithhyper-parametersalpha=beta=50. This,ratherappropriately,assignsfarmorecredibilitytovaluesofthetathatare concordantwithauniversewithoutmagicalrainbowsocks.Asgoodscientists,though,we havetobeopen-mindedtonewpossibilities,sothisdoesn’truleoutthepossibilitythatthe socksgivemespecialpowers—theprobabilityislow,butnotzero,forextremevaluesof theta.Wewillcallthisprior#2. Figure7.3:Askeptic’sprior BeforeweperformtheBayesianupdate,IneedtoexplainwhyIchosetousethebeta distributiontodescribemypriors. TheBayesianupdate—gettingtotheposterior—isperformedbymultiplyingtheprior withthelikelihood.InthevastmajorityofapplicationsofBayesiananalysis,wedon’t knowwhatthatposteriorlookslike,sowehavetosamplefromitmanytimestogeta senseofitsshape.Wewillbedoingthislaterinthischapter. Forcaseslikethis,though,wherethelikelihoodisabinomialfunction,usingabeta distributionforourpriorguaranteesthatourposteriorwillalsobeinthebetadistribution family.Thisisbecausethebetadistributionisaconjugatepriorwithrespecttoabinomial likelihoodfunction.Therearemanyothercasesofdistributionsbeingself-conjugatewith respecttocertainlikelihoodfunctions,butitdoesn’toftenhappeninpracticethatwefind ourselvesinapositiontousethemaseasilyaswecanforthisproblem.Thebeta distributionalsohasthenicepropertythatitisnaturallyconfinedfrom0to1,justlikethe proportionofcoinflipsIcancorrectlypredict. Thefactthatweknowhowtocomputetheposteriorfromthepriorandlikelihoodbyjust changingthebetadistribution’shyper-parametersmakesthingsreallyeasyinthiscase. Thehyper-parametersoftheposteriordistributionare: Thatmeanstheposteriordistributionusingprior#1willhavehyper-parameters alpha=1+20andbeta=1+10.ThisisshowninFigure7.4. Figure7.4:TheresultoftheBayesianupdateoftheevidenceandprior#1.Theinterval depictsthe95%credibleinterval(thedensest95%oftheareaundertheposterior distribution).Thisintervaloverlapsslightlywiththeta=0.5. Acommonwayofsummarizingtheposteriordistributioniswithacredibleinterval.The credibleintervalontheplotinFigure7.4isthe95%credibleintervalandcontains95%of thedensestareaunderthecurveoftheposteriordistribution. Donotconfusethiswithaconfidenceinterval.Thoughitmaylooklikeit,thiscredible intervalisverydifferentthanaconfidenceinterval.Sincetheposteriordirectlycontains informationabouttheprobabilityofourparameterofinterestatdifferentvalues,itis admissibletoclaimthatthereisa95%chancethatthecorrectparametervalueisinthe credibleinterval.Wecouldmakenosuchclaimwithconfidenceintervals.Pleasedonot mixupthetwomeanings,orpeoplewilllaughyououtoftown. Observethatthe95%mostlikelyvaluesforthetacontainthethetavalue0.5,ifonly barely.Duetothis,onemaywishtosaythattheevidencedoesnotruleoutthepossibility thatI’mfullofbaloneyregardingmymagicalrainbowsocks,buttheevidencewas suggestive. Tobeclear,theendresultofourBayesiananalysisistheposteriordistributiondepicting thecredibilityofdifferentvaluesofourparameter.Thedecisiontointerpretthisas sufficientorinsufficientevidenceformyoutlandishclaimisadecisionthatisseparate fromtheBayesiananalysisproper.IncontrasttoNHST,theinformationwegleanfrom Bayesianmethods—theentireposteriordistribution—ismuchricher.Anotherthingthat makesBayesianmethodsgreatisthatyoucanmakeintuitiveclaimsabouttheprobability ofhypothesesandparametervaluesinawaythatfrequentistNHSTdoesnotallowyouto do. Whatdoesthatposteriorusingprior#2looklike?It’sabetadistributionwithalpha= 50+20andbeta=50+10: >curve(dbeta(x,70,60),#plotabetadistribution +xlab="θ",#namex-axis +ylab="posteriorbelief",#namey-axis +type="l",#makesmoothline +yaxt='n')#removeyaxislabels >abline(v=.5,lty=2)#makelineattheta=0.5 Figure7.5:Posteriordistributionofthetausingprior#2 Choosingaprior Noticethattheposteriordistributionlooksalittledifferentdependingonwhatprioryou use.ThemostcommoncriticismlodgedagainstBayesianmethodsisthatthechoiceof prioraddsanunsavorysubjectiveelementtoanalysis.Toacertainextent,they’reright abouttheaddedsubjectiveelement,buttheirallegationthatitisunsavoryiswayoffthe mark. Toseewhy,checkoutFigure7.6,whichshowsbothposteriordistributions(frompriors #1and#2)inthesameplot.Noticehowpriors#1and#2—twoverydifferentpriors— giventheevidence,produceposteriorsthatlookmoresimilartoeachotherthanthepriors did. Figure7.6:Theposteriordistributionsfromprior#1and#2 NowdirectyourattentiontoFigure7.7,whichshowstheposteriorofbothpriorsifthe evidenceincluded80outof120correcttrials. Figure7.7:Theposteriordistributionsfromprior#1and#2withmoreevidence Notethattheevidencestillcontains67%correcttrials,butthereisnowmoreevidence. Theposteriordistributionsarenowfarmoresimilar.Noticethatnowbothofthe posteriors’credibleintervalsdonotcontaintheta=0.5;with80outof120trials correctlypredicted,eventhemostobstinateskeptichastoconcedethatsomethingisgoing on(thoughtheywillprobablydisagreethatthepowercomesfromthesocks!). Takenoticealsoofthefactthatthecredibleintervals,inbothposteriors,arenow substantiallynarrowing,illustratingmoreconfidenceinourestimate. Finally,imaginethecasewhereIcorrectlypredicted67%ofthetrials,butoutof450total trials.TheposteriorsderivedfromthisevidenceareshowninFigure7.8: Figure7.8:Theposteriordistributionsfromprior#1and#2withevenmoreevidence Theposteriordistributionsarelookingverysimilar—indeed,theyarebecomingidentical. Givenenoughtrials—givenenoughevidence—theseposteriordistributionswillbeexactly thesame.Whenthereisenoughevidenceavailablesuchthattheposteriorisdominatedby itcomparedtotheprior,itiscalledoverwhelmingtheprior. Aslongasthepriorisreasonable(thatis,itdoesn’tassignaprobabilityof0to theoreticallyplausibleparametervalues),givenenoughevidence,everybody’sposterior beliefwilllookverysimilar. Thereisnothingunsavoryormisleadingaboutananalysisthatusesasubjectiveprior;the analystjusthastodisclosewhatherprioris.Youcan’tjustpickapriorwilly-nilly;ithas tobejustifiabletoyouraudience.Inmostsituations,apriormaybeinformedbyprior evidencelikescientificstudiesandcanbesomethingthatmostpeoplecanagreeon.A moreskepticalaudiencemaydisagreewiththechosenprior,inwhichcasetheanalysis canbere-runusingtheirprior,justlikewedidinthemagicsocksexample.Itissometimes okayforpeopletohavedifferentpriorbeliefs,anditisokayforsomepeopletorequirea littlemoreevidenceinordertobeconvincedofsomething. Thebeliefthatfrequentisthypothesistestingismoreobjective,andthereforemorecorrect, ismistakeninsofarasitcausesallpartiestohaveaholdonthesamepotentiallybad assumptions.TheassumptionsinBayesiananalysis,ontheotherhand,arestatedclearly fromthestart,madepublic,andareauditable. Torecap,therearethreesituationsyoucancomeacross.Inallofthese,itmakessenseto useBayesianmethods,ifthat’syourthing: Youhavealotofevidence,anditmakesnorealdifferencewhichpriorany reasonablepersonuses,becausetheevidencewilloverwhelmit. Youhaveverylittleevidence,buthavetomakeanimportantdecisiongiventhe evidence.Inthiscase,you’dbefoolishtonotuseallavailableinformationtoinform yourdecisions. Youhaveamediumamountofevidence,anddifferentposteriorsillustratethe updatedbeliefsfromadiversearrayofpriorbeliefs.Youmayrequiremoreevidence toconvincetheextremelyskeptical,butthemajorityofinterestedpartieswillbe cometothesameconclusions. Whocaresaboutcoinflips Whocaresaboutcoinflips?Well,virtuallynoone.However,(a)coinflipsareagreat simpleapplicationtogetthehangofBayesiananalysis;(b)thekindsofproblemsthata betapriorandabinomiallikelihoodfunctionsolvegowaybeyondassessingthefairness ofcoinflips.WearenowgoingtoapplythesametechniquetoareallifeproblemthatI actuallycameacrossinmywork. Formyjob,Ihadtocreateacareerrecommendationsystemthataskedtheuserafew questionsabouttheirpreferencesandspatoutsomecareerstheymaybeinterestedin. Afterafewhours,Ihadaworkingprototype.Inordertojustifyputtingmoreresources intoimprovingtheproject,IhadtoprovethatIwasontosomethingandthatmycurrent recommendationsperformedbetterthanchance. Inordertotestthis,wegot40peopletogether,askedthemthequestions,andpresented themwithtwosetsofrecommendations.OnewasthetruesetofrecommendationsthatI cameupwith,andonewasacontrolset—therecommendationsofapersonwhoanswered thequestionsrandomly.Ifmysetofrecommendationsperformedbetterthanchance woulddictate,thenIhadagoodthinggoing,andcouldjustifyspendingmoretimeonthe project. Simplyperformingbetterthanchanceisnogreatfeatonitsown—Ialsowantedreally goodestimatesofhowmuchbetterthanchancemyinitialrecommendationswere. Forthisproblem,IbrokeoutmyBayesiantoolbox!Theparameterofinterestisthe proportionofthetimemyrecommendationsperformedbetterthanchance.If.05and lowerwereveryunlikelyvaluesoftheparameter,asfarastheposteriordepicted,thenI couldconcludethatIwasontosomething. EventhoughIhadstrongsuspicionsthatmyrecommendationsweregood,Iuseda uniformbetapriortopreemptivelythwartcriticismsthatmypriorbiasedtheconclusions. Asforthelikelihoodfunction,itisthesamefunctionfamilyweusedforthecoinflips (justwithdifferentparameters). Itturnsoutthat36outofthe40peoplepreferredmyrecommendationstotherandomones (threelikedthemboththesame,andoneweirdolikedtherandomonesbetter).The posteriordistribution,therefore,wasabetadistributionwithparameters37and5. >curve(dbeta(x,37,5),xlab="θ", +ylab="posteriorbelief", +type="l",yaxt='n') Figure7.9:Theposteriordistributionoftheeffectivenessofmyrecommendationsusinga uniformprior Again,theendresultoftheBayesiananalysisproperistheposteriordistributionthat illustratescrediblevaluesoftheparameter.Thedecisiontosetanarbitrarythresholdfor concludingthatmyrecommendationswereeffectiveornotisaseparatematter. Let’ssaythat,beforethefact,westatedthatif.05orlowerwerenotamongthe95%most crediblevalues,wewouldconcludethatmyrecommendationswereeffective.Howdowe knowwhatthecredibleintervalboundsare? Eventhoughitisrelativelystraightforwardtodeterminetheboundsofthecredible intervalanalytically,doingsoourselvescomputationallywillhelpusunderstandhowthe posteriordistributionissummarizedintheexamplesgivenlaterinthischapter. Tofindthebounds,wewillsamplefromabetadistributionwithhyper-parameters37and 5thousandsoftimesandfindthequantilesat.025and.975. >samp<-rbeta(10000,37,5) >quantile(samp,c(.025,.975)) 2.5%97.5% 0.76745910.9597010 Neat!Withthepreviousplotalreadyup,wecanaddlinestotheplotindicatingthis95% credibleinterval,likeso: #horizontalline >lines(c(.767,.96),c(0.1,0.1) >#tinyverticalleftboundary >lines(c(.767,.769),c(0.15,0.05)) >#tinyverticalrightboundary >lines(c(.96,.96),c(0.15,0.05)) Ifyouplotthisyourself,you’llseethateventhelowerboundisfarfromthedecision boundary—itlookslikemyworkwasworthitafterall! Thetechniqueofsamplingfromadistributionmanymanytimestoobtainnumerical resultsisknownasMonteCarlosimulation. EnterMCMC–stageleft Asmentionedearlier,westartedwiththecoinflipexamplesbecauseoftheeaseof determiningtheposteriordistributionanalytically—primarilybecauseofthebeta distribution’sself-conjugacywithrespecttothebinomiallikelihoodfunction. Itturnsoutthatmostreal-worldBayesiananalysesrequireamorecomplicatedsolution.In particular,thehyper-parametersthatdefinetheposteriordistributionarerarelyknown. Whatcanbedeterminedistheprobabilitydensityintheposteriordistributionforeach parametervalue.Theeasiestwaytogetasenseoftheshapeoftheposterioristosample fromitmanythousandsoftimes.Morespecifically,wesamplefromallpossible parametervaluesandrecordtheprobabilitydensityatthatpoint. Howdowedothis?Well,inthecaseofjustoneparametervalue,it’soften computationallytractabletojustrandomlysamplewilly-nillyfromthespaceofall possibleparametervalues.ForcaseswhereweareusingBayesiananalysistodetermine thecrediblevaluesfortwoparameters,thingsgetalittlemorehairy. Theposteriordistributionformorethanoneparametervalueisacalledajoint distribution;inthecaseoftwoparameters,itis,morespecifically,abivariatedistribution. OnesuchbivariatedistributioncanbeseeninFigure7.10: Figure7.10:Abivariatenormaldistribution Topicturewhatitisliketosampleabivariateposterior,imagineplacingabelljarontop ofapieceofgraphpaper(becarefultomakesureEsterGreenwoodisn’tunderthere!).We don’tknowtheshapeofthebelljarbutwecan,foreachintersectionofthelinesinthe graphpaper,findtheheightofthebelljaroverthatexactpoint.Clearly,thesmallerthe gridonthegraphpaper,thehigherresolutionourestimateoftheposteriordistributionis. Notethatintheunivariatecase,weweresamplingfromnpoints,inthebivariatecase,we aresamplingfrom points(npointsforeachaxis).Formodelswithmorethantwo parameters,itissimplyintractabletousethisrandomsamplingmethod.Luckily,there’sa betteroptionthanjustrandomlysamplingtheparameterspace:MarkovChainMonte Carlo(MCMC). IthinktheeasiestwaytogetasenseofwhatMCMCis,isbylikeningittothegamehot andcold.Inthisgame—whichyoumayhaveplayedasachild—anobjectishiddenanda searcherisblindfoldedandtaskedwithfindingthisobject.Asthesearcherwanders around,theotherplayertellsthesearcherwhethersheishotorcold;hotifsheisnearthe object,coldwhensheisfarfromtheobject.Theotherplayeralsoindicateswhetherthe movementofthesearcherisgettingherclosertotheobject(gettingwarmer)orfurther fromtheobject(gettingcooler). Inthisanalogy,warmregionsareareasweretheprobabilitydensityoftheposterior distributionishigh,andcoolregionsaretheareaswerethedensityislow.Putinthisway, randomsamplingislikethesearcherteleportingtorandomplacesinthespacewherethe otherplayerhidtheobjectandjustrecordinghowhotorcolditisatthatpoint.Theguided behavioroftheplayerwedescribedbeforeisfarmoreefficientatexploringtheareasof interestinthespace. Atanyonepoint,theblindfoldedsearcherhasnomemoryofwhereshehasbeenbefore. Hernextpositiononlydependsonthepointsheisatcurrently(andthefeedbackofthe otherplayer).Amemory-lesstransitionprocesswherebythenextpositiondependsonly uponthecurrentposition,andnotonanypreviouspositions,iscalledaMarkovchain. Thetechniquefordeterminingtheshapeofhigh-dimensionalposteriordistributionsis thereforecalledMarkovchainMonteCarlo,becauseitusesMarkovchainstointelligently samplemanytimesfromtheposteriordistribution(MonteCarlosimulation). ThedevelopmentofsoftwaretoperformMCMConcommodityhardwareis,forthemost part,responsibleforaBayesianrenaissanceinrecentdecades.Problemsthatwere,nottoo longago,completelyintractablearenowpossibletobeperformedonevenrelativelylowpoweredcomputers. ThereisfarmoretoknowaboutMCMCthenwehavethespacetodiscusshere.Luckily, wewillbeusingsoftwarethatabstractssomeofthesedeepertopicsawayfromus. Nevertheless,ifyoudecidetouseBayesianmethodsinyourownanalyses(andIhopeyou do!),I’dstronglyrecommendconsultingresourcesthatcanaffordtodiscussMCMCata deeperlevel.Therearemanysuchresources,availableforfree,ontheweb. Beforewemoveontoexamplesusingthismethod,itisimportantthatwebringupthis onelastpoint:Mathematically,aninfinitelylongMCMCchainwillgiveusaperfect pictureoftheposteriordistribution.Unfortunately,wedon’thaveallthetimeintheworld (universe[?]),andwehavetosettleforafinitenumberofMCMCsamples.Thelonger ourchains,themoreaccuratethedescriptionoftheposterior.Asthechainsgetlongerand longer,eachnewsampleprovidesasmallerandsmalleramountofnewinformation (economistscallthisdiminishingmarginalreturns).ThereisapointintheMCMC samplingwherethedescriptionoftheposteriorbecomessufficientlystable,andforall practicalpurposes,furthersamplingisunnecessary.Itisatthispointthatwesaythechain converged.Unfortunately,thereisnoperfectguaranteethatourchainhasachieved convergence.OfallthecriticismsofusingBayesianmethods,thisisthemostlegitimate— butonlyslightly. Therearereallyeffectiveheuristicsfordeterminingwhetherarunningchainhas converged,andwewillbeusingafunctionthatwillautomaticallystopsamplingthe posterioronceithasachievedconvergence.Further,convergencecanbeallbutperfectly verifiedbyvisualinspection,aswe’llseesoon. Forthesimplemodelsinthischapter,noneofthiswillbeaproblem,anyway. UsingJAGSandrunjags Althoughit’sabitsillytobreakoutMCMCforthesingle-parametercareer recommendationanalysisthatwediscussedearlier,applyingthismethodtothissimple examplewillaidinitsusageformorecomplicatedmodels. Inordertogetstarted,youneedtoinstallasoftwareprogramcalledJAGS,whichstands forJustAnotherGibbsSampler(aGibbssamplerisatypeofMCMCsampler).This programisindependentofR,butwewillbeusingRpackagestocommunicatewithit. AfterinstallingJAGS,youwillneedtoinstalltheRpackagesrjags,runjags,and modeest.Asareminder,youcaninstallallthreewiththiscommand: >install.packages(c("rjags","runjags","modeest")) Tomakesureeverythingisinstalledproperly,loadtherunjagspackage,andrunthe functiontestjags().Myoutputlookssomethinglikethis: >library(runjags) >testjags() YouareusingRversion3.2.1(2015-06-18)onaunixmachine, withtheRStudioGUI Therjagspackageisinstalled JAGSversion3.4.0foundsuccessfullyusingthecommand '/usr/local/bin/jags' Thefirststepistocreatethemodelthatdescribesourproblem.Thismodeliswritteninan R-likesyntaxandstoredinastring(charactervector)thatwillgetsenttoJAGSto interpret.Forthisproblem,wewillstorethemodelinastringvariablecalledour.model, andthemodellookslikethis: our.model<-" model{ #likelihoodfunction numSuccesses~dbinom(successProb,numTrials) #prior successProb~dbeta(1,1) #parameterofinterest theta<-numSuccesses/numTrials }" NotethattheJAGSsyntaxallowsforR-stylecomments,whichIincludedforclarity. Inthefirstfewlinesofthemodel,wearespecifyingthelikelihoodfunction.Asweknow, thelikelihoodfunctioncanbedescribedwithabinomialdistribution.Theline: numSuccesses~dbinom(successProb,numTrials) saysthevariablenumSuccessesisdistributedaccordingtothebinomialfunctionwith hyper-parametersgivenbyvariablesuccessProbandnumTrials. Inthenextrelevantline,wearespecifyingourchoiceofthepriordistribution.Inkeeping withourpreviouschoice,thislinereads,roughly:thesuccessProbvariable(referredtoin thepreviousrelevantline)isdistributedinaccordancewiththebetadistributionwith hyper-parameters1and1. Inthelastline,wearespecifyingthattheparameterwearereallyinterestedinisthe proportionofsuccesses(numberofsuccessesdividedbythenumberoftrials).Weare callingthattheta.Noticethatweusedthedeterministicassignmentoperator(<-)instead ofthedistributedaccordingtooperator(~)toassigntheta. ThenextstepistodefinethesuccessProbandnumTrialsvariablesforshippingtoJAGS. WedothisbystuffingthesevariablesinanRlist.Wedothisasfollows: our.data<-list( numTrials=40, successProb=36/40 ) Great!WeareallsettoruntheMCMC. >results<-autorun.jags(our.model, +data=our.data, +n.chains=3, +monitor=c('theta')) ThefunctionthatrunstheMCMCsamplerandautomaticallystopsatconvergenceis autorun.jags.ThefirstargumentisthestringspecifyingtheJAGSmodel.Next,wetell thefunctionwheretofindthedatathatJAGSwillneed.Afterthis,wespecifythatwe wanttorun3independentMCMCchains;thiswillhelpguaranteeconvergenceand,ifwe runtheminparallel,drasticallycutdownonthetimewehavetowaitforoursamplingto bedone.(Toseesomeoftheotheroptionsavailable,asalways,youcanrun? autorun.jags.)Lastly,wespecifythatweareinterestedinthevariable‘theta’. Afterthisisdone,wecandirectlyplottheresultsvariablewheretheresultsofthe MCMCarestored.TheoutputofthiscommandisshowninFigure7.11. >plot(results, +plot.type=c("histogram","trace"), +layout=c(2,1)) Figure7.11:OutputplotsfromtheMCMCresults.Thetopisatraceplotofthetavalues alongthechain’slength.Thebottomisabarplotdepictingtherelativecredibilityof differentthetavalues. Thefirstoftheseplotsiscalledatraceplot.Itshowsthesampledvaluesofthetaasthe chaingotlonger.Thefactthatallthreechainsareoverlappingaroundthesamesetof valuesis,atleastinthiscase,astrongguaranteethatallthreechainshaveconverged.The bottomplotisabarplotthatdepictstherelativecredibilityofdifferentvaluesoftheta.Itis shownhereasabarplot,andnotasmoothcurve,becausethebinomiallikelihoodfunction isdiscrete.Ifwewantacontinuousrepresentationoftheposteriordistribution,wecan extractthesamplevaluesfromtheresultsandplotitasadensityplotwithasufficiently largebandwidth: >#mcmcsamplesarestoredinmcmcattribute >#ofresultsvariable >results.matrix<-as.matrix(results$mcmc) > >#extractthesamplesfor'theta' >#theonlycolumn,inthiscase >theta.samples<-results.matrix[,'theta'] > >plot(density(theta.samples,adjust=5)) Andwecanaddtheboundsofthe95%credibleintervaltotheplotasbefore: >quantile(theta.samples,c(.025,.975)) 2.5%97.5% 0.8000.975 >lines(c(.8,.975),c(0.1,0.1)) >lines(c(.8,.8),c(0.15,0.05)) >lines(c(.975,.975),c(0.15,0.05)) Figure7.12:Densityplotoftheposteriordistribution.Notethatthex-axisstartshereat 0.6 Restassuredthatthereisonlyadisagreementbetweenthetwocredibleintervals’bounds inthisexamplebecausetheMCMCcouldonlysamplediscretevaluesfromtheposterior sincethelikelihoodfunctionisdiscrete.Thiswillnotoccurintheotherexamplesinthis chapter.Regardless,thetwomethodsseemtobeinagreementabouttheshapeofthe posteriordistributionandthecrediblevaluesoftheta.Itisallbutcertainthatmy recommendationsarebetterthanchance.Gome! FittingdistributionstheBayesianway Inthisnextexample,wearegoingtobefittinganormaldistributiontotheprecipitation datasetthatweworkedwithinthepreviouschapter.WewillwrapupwithBayesian analoguetotheonesamplet-test. Theresultswewantfromthisanalysisarecrediblevaluesofthetruepopulationmeanof theprecipitationdata.Referbacktothepreviouschaptertorecallthatthesamplemean was34.89.Inaddition,wewillalsobedeterminingcrediblevaluesofthestandard deviationoftheprecipitationdata.Sinceweareinterestedinthecrediblevaluesoftwo parameters,ourposteriordistributionisajointdistribution. Ourmodelwilllookalittledifferentlynow: the.model<-" model{ mu~dunif(0,60)#prior stddev~dunif(0,30)#prior tau<-pow(stddev,-2) for(iin1:theLength){ samp[i]~dnorm(mu,tau)#likelihoodfunction } }" Thistime,wehavetosettwopriors,oneforthemeanoftheGaussiancurvethatdescribes theprecipitationdata(mu),andoneforthestandarddeviation(stddev).Wealsohaveto createavariablecalledtauthatdescribestheprecision(inverseofthevariance)ofthe curve,becausednorminJAGStakesthemeanandtheprecisionashyper-parameters(and notthemeanandstandarddeviation,likeR).Wespecifythatourpriorforthemu parameterisuniformlydistributedfrom0inchesofrainto60inchesofrain—farabove anyreasonablevalueforthepopulationprecipitationmean.Wealsospecifythatourprior forthestandarddeviationisaflatonefrom0to30.Ifthiswerepartofanymeaningful analysisandnotjustapedagogicalexample,ourpriorswouldbeinformedinpartby precipitationdatafromotherregionsliketheUSormyprecipitationdatafromprevious years.JAGScomeschockfullofdifferentfamiliesofdistributionsforexpressingdifferent priors. Next,wespecifythatthevariablesamp(whichwillholdtheprecipitationdata)is distributednormallywithunknownparametersmuandtau. Then,weconstructanRlisttoholdthevariablestosendtoJAGS: the.data<-list( samp=precip, theLength=length(precip) ) Cool,let’srunit!Onmycomputer,thistakes5seconds. >results<-autorun.jags(the.model, +data=the.data, +n.chains=3, +#nowwecareabouttwoparameters +monitor=c('mu','stddev')) Let’splottheresultsdirectlylikebefore,whilebeingcarefultoplotboththetraceplotand histogramfrombothparametersbyincreasingthelayoutargumentinthecalltotheplot function. >plot(results, +plot.type=c("histogram","trace"), +layout=c(2,2)) Figure7.13:OutputplotsfromtheMCMCresultoffittinganormalcurvetothebuilt-in precipitationdataset Figure7.14showsthedistributionofcrediblevaluesofthemuparameterwithout referencetothestddevparameter.Thisiscalledamarginaldistribution. Figure7.14:Marginaldistributionofposteriorforparameter‘mu’.Dashedlineshows hypotheticalpopulationmeanwithin95%credibleinterval Rememberwhen,inthelastchapter,wewantedtodeterminewhethertheUS’mean precipitationwassignificantlydiscrepantfromthe(hypothetical)knownpopulationmean precipitationoftherestoftheworldof38inches.Ifwetakeanyvalueoutsidethe95% credibleintervaltoindicatesignificance,then,justlikewhenweusedtheNHSTt-test,we havetorejectthehypothesisthatthereissignificantlymoreorlessrainintheUSthanin therestoftheworld. Beforewemoveontothenextexample,youmaybeinterestedincrediblevaluesforboth themeanandthestandarddeviationatthesametime.Agreattypeofplotfordepicting thisinformationisacontourplot,whichillustratestheshapeofathree-dimensional surfacebyshowingaseriesoflinesforwhichthereisequalheight.InFigure7.15,each lineshowstheedgesofasliceoftheposteriordistributionthatallhaveequalprobability density. >results.matrix<-as.matrix(results$mcmc) > >library(MASS) >#weneedtomakeakerneldensity >#estimateofthe3-dsurface >z<-kde2d(results.matrix[,'mu'], +results.matrix[,'stddev'], +n=50) > >plot(results.matrix) >contour(z,drawlabels=FALSE, +nlevels=11,col=rainbow(11), +lwd=3,add=TRUE) Figure7.15:Contourplotofthejointposteriordistribution.Thepurplecontour correspondstotheregionwiththehighestprobabilitydensity Thepurplecontours(theinner-mostcontours)showtheregionoftheposteriorwiththe highestprobabilitydensity.Thesecorrespondtothemostlikelyvaluesofourtwo parameters.Asyoucansee,themostlikelyvaluesoftheparametersforthenormal distributionthatbestdescribesourpresentknowledgeofUSprecipitationareameanofa littlelessthan35andastandarddeviationofalittlelessthan14.Wecancorroboratethe resultsofourvisualinspectionbydirectlyprintingtheresultsvariable: >print(results) JAGSmodelsummarystatisticsfrom30000samples(chains=3;adapt+burnin =5000): Lower95MedianUpper95MeanSDMode mu31.64534.86238.18134.8661.663934.895 stddev11.66913.88616.37613.9671.212213.773 MCerrMC%ofSDSSeffAC.10psrf mu0.0122380.7184840.0026841.0001 stddev0.00939510.816649-0.00535881.0001 Totaltimetaken:5seconds whichalsoshowsothersummarystatisticsfromourMCMCsamplesandsome informationabouttheMCMCprocess. TheBayesianindependentsamplest-test Forourlastexampleinthechapter,wewillbeperformingasort-ofBayesiananalogueto thetwo-samplet-testusingthesamedataandproblemfromthecorrespondingexamplein thepreviouschapter—testingwhetherthemeansofthegasmileageforautomaticand manualcarsaresignificantlydifferent. Note ThereisanotherpopularBayesianalternativetoNHST,whichusessomethingcalled Bayesfactorstocomparethelikelihoodofthenullandalternativehypotheses. Asbefore,let’sspecifythemodelusingnon-informativeflatpriors: the.model<-" model{ #eachgroupwillhaveaseparatemu #andstandarddeviation for(jin1:2){ mu[j]~dunif(0,60)#prior stddev[j]~dunif(0,20)#prior tau[j]<-pow(stddev[j],-2) } for(iin1:theLength){ #likelihoodfunction y[i]~dnorm(mu[x[i]],tau[x[i]]) } }" Noticethattheconstructthatdescribesthelikelihoodfunctionisalittledifferentnow;we havetousenestedsubscriptsforthemuandtauparameterstotellJAGSthatweare dealingwithtwodifferentversionsofmuandstddev. Next,thedata: the.data<-list( y=mtcars$mpg, #'x'needstostartat1so #1isnowautomaticand2ismanual x=ifelse(mtcars$am==1,1,2), theLength=nrow(mtcars) ) Finally,let’sroll! >results<-autorun.jags(the.model, +data=the.data, +n.chains=3, +monitor=c('mu','stddev')) Let’sextractthesamplesforboth‘mu’sandmakeavectorthatholdsthedifferencesinthe musamplesbetweeneachofthetwogroups. >results.matrix<-as.matrix(results$mcmc) >difference.in.means<-(results.matrix[,1]– +results.matrix[,2]) Figure7.16showsaplotofthecredibledifferencesinmeans.Thelikelydifferencesin meansarefaraboveadifferenceofzero.Weareallbutcertainthatthemeansofthegas mileageforautomaticandmanualcarsaresignificantlydifferent. Figure7.16:Crediblevaluesforthedifferenceinmeansofthegasmileagebetween automaticandmanualcars.Thedashedlineisatadifferenceofzero Noticethatthedecisiontomimictheindependentsamplest-testmadeusfocusonone particularpartoftheBayesiananalysisanddidn’tallowustoappreciatesomeoftheother veryvaluableinformationtheanalysisyielded.Forexample,inadditiontohavinga distributionillustratingcredibledifferencesinmeans,wehavetheposteriordistribution forthecrediblevaluesofboththemeansandstandarddeviationsofbothsamples.The abilitytomakeadecisiononwhetherthesamples’meansaresignificantlydifferentisnice —theabilitytolookattheposteriordistributionoftheparametersisbetter. Exercises Practisethefollowingexercisestoreinforcetheconceptslearnedinthischapter: WriteafunctionthatwilltakeavectorholdingMCMCsamplesforaparameterand plotadensitycurvedepictingtheposteriordistributionandthe95%credibleinterval. Becarefulofdifferentscalesonthey-axis. Fittinganormalcurvetoanempiricaldistributionisconceptuallyeasy,butnotvery robust.Fordistributionfittingthatismorerobusttooutliers,it’scommontouseatdistributioninsteadofthenormaldistribution,sincethethasheaviertails.Viewthe distributionoftheshapeattributeofthebuilt-inrockdataset.Doesthislook normallydistributed?Findtheparametersofanormalcurvethatisafittothedata.In JAGS,dt,thet-distributiondensityfunction,takesthreeparameters:themean,the precision,andthedegreesoffreedomthatcontrolstheheavinessofthetails.Findthe parametersafterfittingat-distributiontothedata.Arethemeanssimilar?Which estimateofthemeandoyouthinkismorerepresentativeofcentraltendency? InTheseus’paradox,awoodenshipbelongingtoTheseushasdecayingboards, whichareremovedandreplacedwithnewlumber.Eventually,alltheboardsinthe originalshiphavebeenreplaced,sothattheshipismadeupofcompletelynew matter.IsitstillTheseus’ship?Ifnot,atwhatpointdiditbecomeadifferentship? WhatwouldAristotlesayaboutthis?AppealtothedoctrineoftheFourCauses. WouldAristotle’sstancestillholdupif—asinThomasHobbes’versionofthe paradox—theoriginaldecayingboardsweresavedandusedtomakeacomplete replicaofTheseus’originalship? Summary Althoughmostintroductorydataanalysistextsdon’tevenbroachthetopicofBayesian methods,you,dearreader,areversedenoughinthismattertostartapplyingthese techniquestorealproblems. WediscoveredthatBayesianmethodscould—atleastforthemodelsinthischapter—not onlyallowustoanswerthesamekindsofquestionswemightusethebinomial,one samplet-test,andtheindependentsamplest-testfor,butprovideamuchricherandmore intuitivedepictionofouruncertaintyinourestimates. Iftheseapproachesinterestyou,Iurgeyoutolearnmoreabouthowtoextendtheseto supersedeotherNHSTtests.Ialsourgeyoutolearnmoreaboutthemathematicsbehind MCMC. Aswiththelastchapter,wecoveredmuchgroundhere.Ifyoumadeitthrough, congratulations! Thisconcludestheunitonconfirmatorydataanalysisandinferentialstatistics.Inthenext unit,wewillbeconcernedlesswithestimatingparameters,andmoreinterestedin prediction.Lastonethereisarottenegg! Chapter8.PredictingContinuous Variables Nowthatwe’vefullycoveredintroductoryinferentialstatistics,we’renowgoingtoshift ourattentiontooneofthemostexcitingandpracticallyusefultopicsindataanalysis: predictiveanalytics.Throughoutthischapter,wearegoingtointroduceconceptsand terminologyfromacloselyrelatedfieldcalledstatisticallearningor,asit’s(somehow) morecommonlyreferredto,machinelearning. Whereasinthelastunit,wewereusingdatatomakeinferencesabouttheworld,thisunit isprimarilyaboutusingdatatomakeinferences(orpredictions)aboutotherdata.Onthe surface,thismightnotsoundmoreappealing,butconsiderthefruitsofthisareaofstudy: ifyou’veeverreceivedacallfromyourcreditcardcompanyaskingtoconfirma suspiciouspurchasethatyou,infact,didnotmake,it’sbecausesophisticatedalgorithms learnedyourpurchasingbehaviorandwereabletodetectdeviationfromthatpattern. Sincethisisthefirstchapterleavinginferentialstatisticsanddelvingintopredictive analytics,it’sonlynaturalthatwewouldstartwithatechniquethatisusedforbothends: linearregression. Atthesurfacelevel,linearregressionisamethodthatisusedbothtopredictthevalues thatcontinuousvariablestakeon,andtomakeinferencesabouthowcertainvariablesare relatedtoacontinuousvariable.Thesetwoprocedures,predictionandinference, foundationallyrelyontheinformationfromstatisticalmodels.Statisticalmodelsare idealizedrepresentationsofatheorymeanttoillustrateandexplainaprocessthat generatesdata.Amodelisusuallyanequation,orseriesofequations,withsomenumber ofparameters. Throughoutthischapter,rememberthequote(generallyattributedto)GeorgeBox: Allmodelsarewrongbutsomeareuseful. Amodelairplaneorcarmightnotbetherealthing,butitcanhelpuslearnandunderstand someprettypowerfulpropertiesoftheobjectthatisbeingmodeled. Althoughlinearregressionis,atahighlevel,conceptuallyquitesimple,itisabsolutely indispensabletomodernappliedstatistics,andathoroughunderstandingoflinearmodels willpayenormousdividendsthroughoutyourcareerasananalyst. Linearmodels AsmallbakingoutfitinupstateNewYorkcalledNoSconeUnturnedkeepscareful recordsofthebakedgoodsitproduces.TheleftpanelofFigure8.1isascatterplotof diametersandcircumferences(incentimeters)ofNoSconeUnturned’scookies,and depictstheirrelationship: Figure8.1:(left)AscatterplotofdiametersandcircumferencesofNoSconeUnturned’s cookies;(right)thesameplotwithabestfitregressionlineplottedoverthedatapoints Astraightlineistheperfectthingtorepresentthisdata.Afterfittingastraightlinetothe data,wecanmakepredictionsaboutthecircumferencesofcookiesthatwehaven’t observed,like11or0.7(ifyouweren’tplayingtruantingradeschool,you’dknowthere’s aconsistentandpredictablerelationshipbetweenthediameterofacircleandthecircle’s circumference,namelyπ,butwe’llignorethatfornow). YoumayhavelearnedthattheequationthatdescribesalineinaCartesianplaneis: where isthey-intercept(theplacewherethelineintersectswiththeverticallineat ),and istheslope(describingthedirectionandsteepnessoftheline).Inlinear regression,theequationdescribing asafunctionof iswrittenas: where (sometimes )isthey-intercept,and (sometimes )istheslope. Collectively,the sareknownasthebetacoefficients. Theequationofthelinethatbestdescribesthisdatais: making and 0andπrespectively. Knowingthis,itiseasytopredictthecircumferencesofcookiesthatwehaven’tmeasured yet.Thecircumferenceofthecookiewithadiameterof11centimetersis0+3.1415()11 or34.558andacookieof0.7centimetersis0+3.1415(0.7)or2.2. Inpredictiveanalytics’parlance,thevariablethatwearetryingtopredictiscalledthe dependent(or,sometimes,target)variable,becauseitsvaluesaredependentonother variables.Thevariablesthatweusetopredictthedependentvariablearecalled independent(or,sometimes,predictor)variables. Beforemovingontoalesssillyexample,itisimportanttounderstandtheproper interpretationoftheslope :itdescribeshowmuchthedependentvariableincreases(or decreases)foreachunitincreaseoftheindependentvariable.Inthiscase,forevery centimeterincreaseinacookie’sdiameter,thecircumferenceincreasesπcentimeters.In contrast,anegative indicatesthatastheindependentvariableincreases,thedependent variabledecreases. Simplelinearregression Ontoasubstantiallylesstrivialexample,let’ssayNoSconeUnturnedhasbeenkeeping carefulrecordsofhowmanyraisins(ingrams)theyhavebeenusingfortheirfamous oatmealraisincookies.Theywanttoconstructalinearmodeldescribingtherelationship betweentheareaofacookie(incentimeterssquared)andhowmanyraisinstheyuse,on average. Inparticular,theywanttouselinearregressiontopredicthowmanygramsofraisinsthey willneedfora1-meterlongoatmealraisincookie.Predictingacontinuousvariable (gramsofraisins)fromothervariablessoundslikeajobforregression!Inparticular,when weusejustasinglepredictorvariable(theareaofthecookies),thetechniqueiscalled simplelinearregression. TheleftpanelofFigure8.2illustratestherelationshipbetweentheareaofcookiesandthe amountofraisinsitused.Italsoshowsthebest-fitregressionline: Figure8.2:(left)AscatterplotofareasandgramsofraisinsinNoSconeUnturned’s cookieswithabest-fitregressionline;(right)thesameplotwithhighlightedresiduals Notethat,incontrasttothelastexample,virtuallynoneofthedatapointsactuallyreston thebest-fitline—therearenowerrors.Thisisbecausethereisarandomcomponentto howmanyraisinsareused. TherightpanelofFigure8.2drawsdashedredlinesbetweeneachdatapointandwhatthe best-fitlinewouldpredictistheamountofraisinsnecessary.Thesedashedlinesrepresent theerrorintheprediction,andtheseerrorsarecalledresiduals. Sofar,wehaven’tdiscussedhowthebest-fitlineisdetermined.Inessence,thelineofthe bestfitwillminimizetheamountofdashedline.Morespecifically,theresidualsare squaredandalladdedup—thisiscalledtheResidualSumofSquares(RSS).Theline thatisthebestfitwillminimizetheRSS.Thismethodiscalledordinaryleastsquares,or OLS. LookatthetwoplotsinFigure8.3.Noticehowtheregressionlinesaredrawninways thatclearlydonotminimizetheamountofredline.TheRSScanbefurtherminimizedby increasingtheslopeinthefirstplot,anddecreasingitinthesecondplot: Figure8.3:TworegressionlinesthatdonotminimizetheRSS Nowthattherearedifferencesbetweentheobservedvaluesandthepredictedvalues—as therewillbeineveryreal-lifelinearregressionyouperform—theequationthatdescribes ,thedependentvariable,changesslightly: Theequationwithouttheresidualtermonlydescribesourprediction, ,pronouncedyhat (becauseitlookslike iswearingalittlehat:) Ourerrortermis,therefore,thedifferencebetweenthevaluethatourmodelpredictsand theactualempiricalvalueforeachobservation : Formally,theRSSis: Recallthatthisisthetermthatgetsminimizedwhenfindingthebest-fitline. IftheRSSisthesumofthesquaredresiduals(orerrorterms),themeanofthesquared residualsisknownastheMeanSquaredError(MSE),andisaveryimportantmeasure oftheaccuracyofamodel. Formally,theMSEis: Occasionally,youwillencountertheRootMeanSquaredError(RMSE)asameasure ofmodelfit.ThisisjustthesquarerootoftheMSE,puttingitinthesameunitsasthe dependentvariable(insteadofunitsofthedependentvariablesquared).Thedifference betweentheMSEandRMSEislikethedifferencebetweenvarianceandstandard deviation,respectively.Infact,inboththesecases(theMSE/RMSEand variance/standard-deviation),theerrortermshavetobesquaredfortheverysamereason; iftheywerenot,thepositiveandnegativeresidualswouldcanceleachotherout. Nowthatwehaveabitoftherequisitemath,we’rereadytoperformasimplelinear regressionourselves,andinterprettheoutput.Wewillbeusingthevenerablemtcarsdata set,andtrytopredictacar’sgasmileage(mpg)withthecar’sweight(wt).Wewillalsobe usingR’sbasegraphicssystem(notggplot2)inthissection,becausethevisualizationof linearmodelsisarguablysimplerinbaseR. First,let’splotthecars’gasmileageasafunctionoftheirweights: >plot(mpg~wt,data=mtcars) HereweemploytheformulasyntaxthatwewerefirstintroducedtoinChapter3, DescribingRelationshipsandthatweusedextensivelyinChapter6,TestingHypotheses. Wewillbeusingitheavilyinthischapteraswell.Asarefresher,mph~wtroughlyreads mpgasafunctionofwt. Next,let’srunasimplelinearregressionwiththelmfunction,andsaveittoavariable calledmodel: >model<-lm(mpg~wt,data=mtcars) Nowthatwehavethemodelsaved,wecan,verysimply,addaplotofthelinearmodelto thescatterplotwehavealreadycreated: >abline(model) Figure8.4:Theresultofplottingoutputfromlm Finally,let’sviewtheresultoffittingthelinearmodelusingthesummaryfunction,and interprettheoutput: >summary(model) Call: lm(formula=mpg~wt,data=mtcars) Residuals: Min1QMedian3QMax -4.5432-2.3647-0.12521.40966.8727 Coefficients: EstimateStd.ErrortvaluePr(>|t|) (Intercept)37.28511.877619.858<2e-16*** wt-5.34450.5591-9.5591.29e-10*** --Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1 Residualstandarderror:3.046on30degreesoffreedom MultipleR-squared:0.7528,AdjustedR-squared:0.7446 F-statistic:91.38on1and30DF,p-value:1.294e-10 Thefirstblockoftextremindsushowthemodelwasbuiltsyntax-wise(whichcan actuallybeusefulinsituationswherethelmcallisperformeddynamically). Next,weseeafive-numbersummaryoftheresiduals.Rememberthatthisisinunitsof thedependentvariable.Inotherwords,thedatapointwiththehighestresidualis6.87 milespergallon. Inthenextblock,labeledCoefficients,directyourattentiontothetwovaluesinthe Estimatecolumn;thesearethebetacoefficientsthatminimizetheRSS.Specifically, and .Theequationthatdescribesthebest-fitlinearmodelthenis: Remember,thewaytointerpretthe coefficientisforeveryunitincreaseofthe independentvariable(it’sinunitsof1,000pounds),thedependentvariablegoesdown (becauseit’snegative)5.345units(whicharemilespergallon).The coefficient indicates,rathernonsensically,thatacarthatweighsnothingwouldhaveagasmileageof 37.285milespergallon.Recallthatallmodelsarewrong,butsomeareuseful. Ifwewantedtopredictthegasmileageofacarthatweighed6,000pounds,ourequation wouldyieldanestimateof5.125milespergallon.Insteadofdoingthemathbyhand,we canusethepredictfunctionaslongaswesupplyitwithadataframethatholdsthe relevantinformationfornewobservationsthatwewanttopredict: >predict(model,newdata=data.frame(wt=6)) 1 5.218297 Interestingly,wewouldpredictacarthatweighs7,000poundswouldget-0.126milesper gallon.Again,allmodelsarewrong,butsomeareuseful.Formostreasonablecarweights, ourverysimplemodelyieldsreasonablepredictions. Ifwewereonlyinterestedinprediction—andonlyinterestedinthisparticularmodel—we wouldstophere.But,asImentionedinthischapter’spreface,linearregressionisalsoa toolforinference—andaprettypowerfuloneatthat.Infact,wewillsoonseethatmany ofthestatisticaltestswewereintroducedtoinChapter6,TestingHypothesescanbe equivalentlyexpressedandperformedasalinearmodel. Whenviewinglinearregressionasatoolofinference,it’simportanttorememberthatour coefficientsareactuallyjustestimates.Thecarsobservedinmtcarsrepresentjustasmall sampleofallextantcars.Ifsomehowweobservedallcarsandbuiltalinearmodel,the betacoefficientswouldbepopulationcoefficients.ThecoefficientsthatweaskedRto calculatearebestguessesbasedonoursample,and,justlikeourotherestimatesin previouschapters,theycanundershootorovershootthepopulationcoefficients,andtheir accuracyisafunctionoffactorssuchasthesamplesize,therepresentativenessofour sample,andtheinherentvolatilityornoisinessofthesystemwearetryingtomodel. Asestimates,wecanquantifyouruncertaintyinourbetacoefficientsusingstandard error,asintroducedinChapter5,UsingDatatoReasonAbouttheWorld.Thecolumnof valuesdirectlytotherightoftheEstimatecolumn,labeledStd.Error,givesusthese measures.Theestimatesofthebetacoefficientsalsohaveasamplingdistributionand, therefore,confidenceintervalscouldbeconstructedforthem. Finally,becausethebetacoefficientshavewelldefinedsamplingdistributions(aslongas certainsimplifyingassumptionsholdtrue),wecanperformhypothesistestsonthem.The mostcommonhypothesistestperformedonbetacoefficientsaskswhethertheyare significantlydiscrepantfromzero.Semantically,ifabetacoefficientissignificantly discrepantfromzero,itisanindicationthattheindependentvariablehasasignificant impactonthepredictionofthedependentvariable.Rememberthelong-runningwarning inChapter6,TestingHypothesesthough:justbecausesomethingissignificantdoesn’t meanitisimportant. Thehypothesistestscomparingthecoefficientstozeroyieldp-values;thosep-valuesare depictedinthefinalcolumnoftheCoefficientssection,labeledPr(>|t|).Weusually don’tcareaboutthesignificanceoftheinterceptcoefficient(b0),sowecanignorethat. Ratherimportantly,thep-valueforthecoefficientbelongingtothewtvariableisnear zero,indicatingthattheweightofacarhassomepredictivepoweronthegasmileageof thatcar. Gettingbacktothesummaryoutput,directyourattentiontotheentrycalledMultipleRsquared.R-squared—also orcoefficientofdetermination—is,likeMSE,ameasureof howgoodofafitthemodelis.IncontrasttotheMSEthough,whichisinunitsofthe dependentvariable, isalwaysbetween0and1,andthus,canbeinterpretedmore easily.Forexample,ifwechangedtheunitsofthedependentvariablefrommilesper gallontomilesperliter,theMSEwouldchange,butthe wouldnot. An of1indicatesaperfectfitwithnoresidualerror,andan of0indicatestheworst possiblefit:theindependentvariabledoesn’thelppredictthedependentvariableatall. Figure8.5:Linearmodels(fromlefttoright)withsof0.75,0.33,and0.92 Helpfully,the isdirectlyinterpretableastheamountofvarianceinthedependent variablethatisexplainedbytheindependentvariable.Inthiscase,forexample,theweight ofacarexplainsabout75.3%ofthevarianceofthegasmileage.Whether75%constitutes agood dependsheavilyonthedomain,butinmyfield(thebehavioralsciences),an of75%isreallygood. Wewillhavetocomebacktotherestofinformationinthesummaryoutputinthesection aboutmultipleregression. Note Takenoteofthefactthatthep-valueoftheF-statisticinthelastlineoftheoutputisthe sameasthep-valueofthet-statisticoftheonlynon-interceptcoefficient. Simplelinearregressionwithabinary predictor Oneofthecoolestthingsaboutlinearregressionisthatwearenotlimitedtousing predictorvariablesthatarecontinuous.Forexample,inthelastsection,weusedthe continuousvariablewt(weight)topredictmilespergallon.Butlinearmodelsare adaptabletousingcategoricalvariables,likeam(automaticormanualtransmission)as well. Normally,inthesimplelinearregressionequation , willholdtheactualvalue ofthepredictorvariable.Inthecaseofasimplelinearregressionwithabinarypredictor (likeam), willholdadummyvariableinstead.Specifically,whenthepredictoris automatic, willbe0,andwhenthepredictorismanual, willbe1. Moreformally: Putinthismanner,theinterpretationofthecoefficientschangesslightly,sincethe bezerowhenthecarisautomatic, isthemeanmilespergallonforautomaticcars. Similarly,since willequal whenthecarismanual, isequaltothemean differenceinthegasmileagebetweenautomaticandmanualcars. Concretely: >model<-lm(mpg~am,data=mtcars) >summary(model) Call: lm(formula=mpg~am,data=mtcars) Residuals: Min1QMedian3QMax -9.3923-3.0923-0.29743.24399.5077 Coefficients: EstimateStd.ErrortvaluePr(>|t|) (Intercept)17.1471.12515.2471.13e-15*** am7.2451.7644.1060.000285*** --Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1 will Residualstandarderror:4.902on30degreesoffreedom MultipleR-squared:0.3598,AdjustedR-squared:0.3385 F-statistic:16.86on1and30DF,p-value:0.000285 > > >mean(mtcars$mpg[mtcars$am==0]) [1]17.14737 >(mean(mtcars$mpg[mtcars$am==1])- +mean(mtcars$mpg[mtcars$am==0])) [1]7.244939 Theinterceptterm, is7.15,whichisthemeangasmileageoftheautomaticcars,and is7.24,whichisthedifferenceofthemeansbetweenthetwogroups. Theinterpretationofthet-statisticandp-valueareveryspecialnow;ahypothesistest checkingtoseeif (thedifferenceingroupmeans)issignificantlydifferentfromzerois tantamounttoahypothesistesttestingequalityofmeans(thestudentst-test)!Indeed,the t-statisticandp-valuesarethesame: #usevar.equaltochooseStudentst-test #overWelch'st-test >t.test(mpg~am,data=mtcars,var.equal=TRUE) TwoSamplet-test data:mpgbyam t=-4.1061,df=30,p-value=0.000285 alternativehypothesis:truedifferenceinmeansisnotequalto0 95percentconfidenceinterval: -10.84837-3.64151 sampleestimates: meaningroup0meaningroup1 17.1473724.39231 Isn’tthatneat!?Atwo-sampletestofequalityofmeanscanbeequivalentlyexpressedasa linearmodel!Thisbasicideacanbeextendedtohandlenon-binarycategoricalvariables too—we’llseethisinthesectiononmultipleregression. Notethatinmtcars,theamcolumnwasalreadycodedas1s(manuals)and0s (automatics).Ifautomaticcarsweredummycodedas1andmanualsweredummycoded as0,theresultswouldsemanticallybethesame;theonlydifferenceisthat wouldbe themeanofmanualcars,and wouldbethe(negative)differenceinmeans.The p-valueswouldbethesame. and Ifyouareworkingwithadatasetthatdoesn’talreadyhavethebinarypredictordummy coded,R’slmcanhandlethistoo,solongasyouwrapthecolumninacalltofactor.For example: >mtcars$automatic<-ifelse(mtcars$am==0,"yes","no") >model<-lm(mpg~factor(automatic),data=mtcars) >model Call: lm(formula=mpg~factor(automatic),data=mtcars) Coefficients: (Intercept)factor(automatic)yes 24.392-7.245 Finally,notethatacarbeingautomaticormanualexplainssomeofthevarianceingas mileage,butfarlessthanweightdid:thismodel’s isonly0.36. Awordofwarning Beforewemoveon,awordofwarning:thefirstpartofeveryregressionanalysisshould betoplottherelevantdata.Toconvinceyouofthis,considerAnscombe’squartetdepicted inFigure8.6 Figure8.6:Fourdatasetswithidenticalmeans,standarddeviations,regression coefficients,and Anscombe’squartetholdsfourx-ypairsthathavethesamemean,standarddeviation, correlationcoefficients,linearregressioncoefficients,and .Inspiteofthese similarities,allfourofthesedatapairsareverydifferent.Itisawarningtonotblindly applystatisticsondatathatyouhaven’tvisualized.Itisalsoawarningtotakelinear regressiondiagnostics(whichwewillgooverbeforethechapter’send)seriously. Onlytwoofthex-ypairsinAnscombe’squartetcanbemodeledwithsimplelinear regression:theonesintheleftcolumn.Ofparticularinterestistheoneonthebottomleft; itlookslikeitcontainsanoutlier.Afterthoroughinvestigationintowhythatdatummade itintoourdataset,ifwedecidewereallyshoulddiscardit,wecaneither(a)removethe offendingrow,or(b)userobustlinearregression. Foramoreorlessdrop-inreplacementforlmthatusesarobustversionofOLScalled IterativelyRe-WeightedLeastSquares(IWLS),youcanusetherlmfunctionfromthe MASSpackage: >library(MASS) >data(anscombe) >plot(y3~x3,data=anscombe) >abline(lm(y3~x3,data=anscombe), +col="blue",lty=2,lwd=2) >abline(rlm(y3~x3,data=anscombe), +col="red",lty=1,lwd=2) Figure8.7:ThedifferencebetweenlinearregressionfitwithOLSandarobustlinear regressionfittedwithIWLS Note OK,onemorewarning Somesuggestthatyoushouldalmostalwaysuserlminfavoroflm.It’struethatrlmisthe bee’sknees,butthereisasubtledangerindoingthisasillustratedbythefollowing statisticalurbanlegend. Sometimein1984,NASAwasstudyingtheozoneconcentrationsfromvariouslocations. NASAusedrobuststatisticalmethodsthatautomaticallydiscardedanomalousdatapoints believingmostofthemtobeinstrumenterrorsorerrorsintransmission.Asaresultof this,someextremelylowozonereadingsintheatmosphereaboveAntarcticawere removedfromNASA’satmosphericmodels.Theverynextyear,Britishscientists publishedapaperdescribingaverydeterioratedozonelayerintheAntarctic.HadNASA paidcloserattentiontooutliers,theywouldhavebeenthefirsttodiscoverit. Itturnsoutthattherelevantpartofthisstoryisamyth,butthefactthatitissowidely believedisatestamenttohowpossibleitis. Thepointis,outliersshouldalwaysbeinvestigatedandnotsimplyignored,becausethey maybeindicativeofpoormodelchoice,faultyinstrumentation,oragiganticholeinthe ozonelayer.Oncetheoutliersareaccountedfor,thenuserobustmethodstoyourheart’s content. Multipleregression Moreoftenthannot,wewanttoincludenotjustone,butmultiplepredictors(independent variables),inourpredictivemodels.Luckily,linearregressioncaneasilyaccommodate us!Thetechnique?Multipleregression. Bygivingeachpredictoritsveryownbetacoefficientinalinearmodel,thetargetvariable getsinformedbyaweightedsumofitspredictors.Forexample,amultipleregression usingtwopredictorvariableslookslikethis: Now,insteadofestimatingtwocoefficients( and ),weareestimatingthree:the intercept,theslopeofthefirstpredictor,andtheslopeofthesecondpredictor. Beforeexplainingfurther,let’sperformamultipleregressionpredictinggasmileagefrom weightandhorsepower: >model<-lm(mpg~wt+hp,data=mtcars) >summary(model) Call: lm(formula=mpg~wt+hp,data=mtcars) Residuals: Min1QMedian3QMax -3.941-1.600-0.1821.0505.854 Coefficients: EstimateStd.ErrortvaluePr(>|t|) (Intercept)37.227271.5987923.285<2e-16*** wt-3.877830.63273-6.1291.12e-06*** hp-0.031770.00903-3.5190.00145** --Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1 Residualstandarderror:2.593on29degreesoffreedom MultipleR-squared:0.8268,AdjustedR-squared:0.8148 F-statistic:69.21on2and29DF,p-value:9.109e-12 Sincewearenowdealingwiththreevariables,thepredictivemodelcannolongerbe visualizedwithaline;itmustbevisualizedasaplanein3Dspace,asseeninFigure8.8: Figure8.8:Thepredictionregionthatisformedbyatwo-predictorlinearmodelisaplane Aidedbythevisualization,wecanseethatbothourpredictionsofmpgareinformedby bothwtandhp.Bothofthemcontributenegativelytothegasmileage.Youcanseethis fromthefactthatthecoefficientsarebothnegative.Visually,wecanverifythisbynoting thattheplaneslopesdownwardaswtincreasesandashpincreases,althoughtheslopefor thelaterpredictorislessdramatic. Althoughwelosetheabilitytoeasilyvisualizeit,thepredictionregionformedbyamorethan-twopredictorlinearmodeliscalledahyperplane,andexistsinn-dimensionalspace wherenisthenumberofpredictorvariablesplus1. Theastutereadermayhavenoticedthatthebetacoefficientbelongingtothewtvariable isnotthesameasitwasinthesimplelinearregression.Thebetacoefficientforhp,too,is differentthantheoneestimatedusingsimpleregression: >coef(lm(mpg~wt+hp,data=mtcars)) (Intercept)wthp 37.22727012-3.87783074-0.03177295 >coef(lm(mpg~wt,data=mtcars)) (Intercept)wt 37.285126-5.344472 >coef(lm(mpg~hp,data=mtcars)) (Intercept)hp 30.09886054-0.06822828 Theexplanationhastodowithasubtledifferenceinhowthecoefficientsshouldbe interpretednowthatthereismorethanoneindependentvariable.Theproperinterpretation ofthecoefficientbelongingtowtisnotthatastheweightofthecarincreasesby1unit (1,000pounds),themilespergallon,onanaverage,decreasesby-3.878milespergallon. Instead,theproperinterpretationisHoldinghorsepowerconstant,astheweightofthecar increasesby1unit(1,000pounds),themilespergallon,onanaverage,decreasesby -3.878milespergallon. Similarly,thecorrectinterpretationofthecoefficientbelongingtowtisHoldingthe weightofthecarconstant,asthehorsepowerofthecarincreasesby1,themilesper gallon,onanaverage,decreasesby-0.032milespergallon.Stillconfused? Itturnsoutthatcarswithmorehorsepowerusemoregas.Itisalsotruethatcarswith higherhorsepowertendtobeheavier.Whenweputthesepredictors(weightand horsepower)intoalinearmodeltogether,themodelattemptstoteaseapartthe independentcontributionsofeachofthevariablesbyremovingtheeffectsoftheother.In multivariateanalysis,thisisknownascontrollingforavariable.Hence,theprefacetothe interpretationcanbe,equivalently,statedasControllingfortheeffectsoftheweightofa car,asthehorsepower….Becausecarswithhigherhorsepowertendtobeheavier,when youremovetheeffectofhorsepower,theinfluenceofweightgoesdown,andviceversa. Thisiswhythecoefficientsforthesepredictorsarebothsmallerthantheyareinsimple single-predictorregression. Incontrolledexperiments,scientistsintroduceanexperimentalconditionontwosamples thatarevirtuallythesameexceptfortheindependentvariablebeingmanipulated(for example,givingonegroupaplaceboandonegrouprealmedication).Iftheyarecareful, theycanattributeanyobservedeffectdirectlyonthemanipulatedindependentvariable.In simplecaseslikethis,statisticalcontrolisoftenunnecessary.Butstatisticalcontrolisof utmostimportanceintheotherareasofscience(especially,thebehavioralandsocial sciences)andbusiness,whereweareprivyonlytodatafromnon-controllednatural phenomena. Forexample,supposesomeonemadetheclaimthatgumchewingcausesheartdisease.To backupthisclaim,theyappealedtodatashowingthatthemoresomeonechewsgum,the highertheprobabilityofdevelopingheartdisease.Theastuteskepticcouldclaimthatit’s notthegumchewingpersethatiscausingtheheartdisease,butthefactthatsmokerstend tochewgummoreoftenthannon-smokerstomaskthegrosssmelloftobaccosmoke.If thepersonwhomadetheoriginalclaimwentbacktothedata,andincludedthenumberof cigarettessmokedperdayasacomponentofaregressionanalysis,therewouldbea coefficientrepresentingtheindependentinfluenceofgumchewing,andostensibly,the statisticaltestofthatcoefficient’sdifferencefromzerowouldfailtorejectthenull hypothesis. Inthissituation,thenumberofcigarettessmokedperdayiscalledaconfoundingvariable. Thepurposeofacarefullydesignedscientificexperimentistoeliminateconfounds,butas mentionedearlier,thisisoftennotaluxuryavailableincertaincircumstancesand domains. Forexample,wearesosurethatcigarettesmokingcausesheartdiseasethatitwouldbe unethicaltodesignacontrolledexperimentinwhichwetaketworandomsamplesof people,andaskonegrouptosmokeandonegrouptojustpretendtosmoke.Sadly, cigarettecompaniesknowthis,andtheycanplausiblyclaimthatitisn’tcigarettesmoking thatcausesheartdisease,butratherthatthekindofpeoplewhoeventuallybecome cigarettesmokersalsoengageinbehaviorsthatincreasetheriskofheartdisease—like eatingredmeatandnotexercising—andthatit’sthosevariablesthataremakingitappear asifsmokingisassociatedwithheartdisease.Sincewecan’tcontrolforeverypotential confoundthatthecigarettecompaniescandreamup,wemayneverbeabletothwartthis claim. Anyhow,backtoourtwo-predictorexample:examinethe value,andhowitisdifferent nowthatwe’veincludedhorsepowerasanadditionalpredictor.Ourmodelnowexplains moreofthevarianceingasmileage.Asaresult,ourpredictionswill,onanaverage,be moreaccurate. Let’spredictwhatthegasmileageofa2,500poundcarwithahorsepowerof275 (horses?)mightbe: >predict(model,newdata=data.frame(wt=2.5,hp=275)) 1 18.79513 Finally,wecanexplainthelastlineofthelinearmodelsummary:theonewiththeFstatisticandassociatedp-value.TheF-statisticmeasurestheabilityoftheentiremodel,as awhole,toexplainanyvarianceinthedependentvariable.Sinceithasasampling distribution(theF-distribution)andassociateddegrees,ityieldsap-value,whichcanbe interpretedastheprobabilitythatamodelwouldexplainthismuch(ormore)ofthe varianceofthedependentvariableifthepredictorshadnopredictivepower.Thefactthat ourmodelhasap-valuelowerthan0.05suggeststhatourmodelpredictsthedependent variablebetterthanchance. Nowwecanseewhythep-valuefortheF-statisticinthesimplelinearregressionwasthe sameasthep-valueofthet-statisticfortheonlynon-interceptpredictor:thetestswere equivalentbecausetherewasonlyonesourceofpredictivecapability. Wecanalsoseenowwhythep-valueassociatedwithourF-statisticinthemultiple regressionanalysisoutputearlierisfarlowerthanthep-valuesofthet-statisticsofthe individualpredictors:thelatteronlycapturesthepredictivepowerofeach(one)predictor, whiletheformercapturesthepredictivepowerofthemodelasawhole(alltwo). Regressionwithanon-binarypredictor Backinaprevioussection,Ipromisedthatthesamedummy-codingmethodthatweused toregressbinarycategoricalvariablescouldbeadaptedtohandlecategoricalvariables withmorethantwovalues.Foranexampleofthis,wearegoingtousethesame WeightLossdatasetaswedidintoillustrateANOVA. Toreview,theWeightLossdatasetcontainspoundslostandself-esteemmeasurementsfor threeweeksforthreedifferentgroups:acontrolgroup,onegroupjustonadiet,andone groupthatdietedandexercised.Wewillbetryingtopredicttheamountofweightlostin week2bythegrouptheparticipantwasin. Insteadofjusthavingonedummy-codedpredictor,wenowneedtwo.Specifically: Consequently,theequationsdescribingourpredictivemodelare: Meaningthatthe isthemeanofweightlostinthecontrolgroup, isthedifferencein theweightlostbetweencontrolanddietonlygroup,and isthedifferenceintheweight lostbetweenthecontrolandthedietandexercisegroup. >#thedatasetisinthecarpackage >library(car) >model<-lm(wl2~factor(group),data=WeightLoss) >summary(model) Call: lm(formula=wl2~factor(group),data=WeightLoss) Residuals: Min1QMedian3QMax -2.100-1.054-0.1000.9002.900 Coefficients: EstimateStd.ErrortvaluePr(>|t|) (Intercept)3.33330.37568.8745.12e-10*** factor(group)Diet0.58330.53121.0980.281 factor(group)DietEx2.76670.55714.9662.37e-05*** --Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1 Residualstandarderror:1.301on31degreesoffreedom MultipleR-squared:0.4632,AdjustedR-squared:0.4285 F-statistic:13.37on2and31DF,p-value:6.494e-05 Asbefore,thep-valuesassociatedwiththet-statisticsaredirectlyinterpretableasat-test ofequalityofmeanswiththeweightlostbythecontrol.Observethatthep-value associatedwiththet-statisticofthefactor(group)Dietcoefficientisnotsignificant.This comportswiththeresultsfromthepairwise-t-testfromChapter6,TestingHypotheses. Mostmagnificently,comparetheF-statisticandtheassociatedp-valueinthepreceding codewiththeoneintheaovANOVAfromChapter6,TestingHypotheses.Theyarethe same!TheF-testofalinearmodelwithanon-binarycategoricalvariablepredictoristhe sameasanNHSTanalysisofvariance! Kitchensinkregression Whenthegoalofusingregressionissimplypredictivemodeling,weoftendon’tcare aboutwhichparticularpredictorsgointoourmodel,solongasthefinalmodelyieldsthe bestpossiblepredictions. Anaïve(andawful)approachistousealltheindependentvariablesavailabletotryto modelthedependentvariable.Let’strythisapproachbytryingtopredictmpgfromevery othervariableinthemtcarsdataset: >#theperiodafterthesquigglydenotesallothervariables >model<-lm(mpg~.,data=mtcars) >summary(model) Call: lm(formula=mpg~.,data=mtcars) Residuals: Min1QMedian3QMax -3.4506-1.6044-0.11961.21934.6271 Coefficients: EstimateStd.ErrortvaluePr(>|t|) (Intercept)12.3033718.717880.6570.5181 cyl-0.111441.04502-0.1070.9161 disp0.013340.017860.7470.4635 hp-0.021480.02177-0.9870.3350 drat0.787111.635370.4810.6353 wt-3.715301.89441-1.9610.0633. qsec0.821040.730841.1230.2739 vs0.317762.104510.1510.8814 am2.520232.056651.2250.2340 gear0.655411.493260.4390.6652 carb-0.199420.82875-0.2410.8122 --Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1 Residualstandarderror:2.65on21degreesoffreedom MultipleR-squared:0.869,AdjustedR-squared:0.8066 F-statistic:13.93on10and21DF,p-value:3.793e-07 Hey,checkoutourR-squaredvalue!Itlookslikeourmodelexplains87%ofthevariance inthedependentvariable.Thisisreallygood—it’scertainlybetterthanoursimple regressionmodelsthatusedweight(wt)andtransmission(am)withtherespectiveRsquaredvalues,0.753and0.36. Maybethere’ssomethingtojustincludingeverythingwehaveinourlinearmodels.In fact,ifouronlygoalistomaximizeourR-squared,youcanalwaysachievethisby throwingeveryvariableyouhaveintothemix,sincetheintroductionofeachmarginal variablecanonlyincreasetheamountofvarianceexplained.Evenifanewlyintroduced variablehasabsolutelynopredictivepower,theworstitcandoisnothelpexplainany varianceinthedependentvariable—itcannevermakethemodelexplainlessvariance. Thisapproachtoregressionanalysisisoften(non-affectionately)calledkitchen-sink regression,andisakintothrowingallofyourvariablesagainstawalltoseewhatsticks.If youhaveahunchthatthisapproachtopredictivemodelingiscrummy,yourinstinctis correctonthisone. Todevelopyourintuitionaboutwhythisapproachbackfires,considerbuildingalinear modeltopredictavariableofonly32observationsusing200explanatoryvariables,which areuniformlyandrandomlydistributed.Justbyrandomchance,therewillverylikelybe somevariablesthatcorrelatestronglytothedependentvariable.Alinearregressionthat includessomeoftheseluckyvariableswillyieldamodelthatissurprisingly(sometimes astoundingly)predictive. Rememberthatwhenwearecreatingpredictivemodels,werarely(ifever)careabout howwellwecanpredictthedatawealreadyhave.Thewholepointofpredictiveanalytics istobeabletopredictthebehaviorofdatawedon’thave.Forexample,memorizingthe answerkeytolastyear’sSocialStudiesfinalwon’thelpyouonthisyear’sfinal,ifthe questionsarechanged—it’llonlyproveyoucangetanA+onyourlastyear’stest. Imaginegeneratinganewrandomdatasetof200explanatoryvariablesandonedependent variable.Usingthecoefficientsfromthelinearmodelofthefirstrandomdataset.How welldoyouthinkthemodelwillperform? Themodelwill,ofcourse,performverypoorly,becausethecoefficientsinthemodelwere informedsolelybyrandomnoise.Themodelcapturedchancepatternsinthedatathatit wasbuiltwithandnotalarger,moregeneralpattern—mostlybecausetherewasnolarger patterntomodel! Instatisticallearningparlance,thisphenomenoniscalledoverfitting,andithappensoften whentherearemanypredictorsinamodel.Itisparticularlyfrequentwhenthenumberof observationsislessthan(ornotverymuchlargerthan)thenumberofpredictorvariables (likeinmtcars),becausethereisagreaterprobabilityforthemanypredictorstohavea spuriousrelationshipwiththedependentvariable. Thisgeneraloccurrence—amodelperformingwellonthedataitwasbuiltwithbutpoorly onsubsequentdata—illustratesperfectlyperhapsthemostcommoncomplicationwith statisticallearningandpredictiveanalytics:thebias-variancetradeoff. Thebias-variancetrade-off Figure8.9:Thetwoextremesofthebias-variancetradeoff:.(left)a(complicated)model withessentiallyzerobias(ontrainingdata)butenormousvariance,(right)asimplemodel withhighbiasbutvirtuallynovariance Instatisticallearning,thebiasofamodelreferstotheerrorofthemodelintroducedby attemptingtomodelacomplicatedreal-liferelationshipwithanapproximation.Amodel withnobiaswillnevermakeanyerrorsinprediction(likethecookie-areaprediction problem).Amodelwithhighbiaswillfailtoaccuratelypredictitsdependentvariable. Thevarianceofamodelreferstohowsensitiveamodelistochangesinthedatathatbuilt themodel.Amodelwithlowvariancewouldchangeverylittlewhenbuiltwithnewdata. Alinearmodelwithhighvarianceisverysensitivetochangestothedatathatitwasbuilt with,andtheestimatedcoefficientswillbeunstable. Thetermbias-variancetradeoffillustratesthatitiseasytodecreasebiasattheexpenseof increasingvariance,andvice-versa.Goodmodelswilltrytominimizeboth. Figure8.9depictstwoextremesofthebias-variancetradeoff.Theleft-mostmodeldepicts acomplicatedandhighlyconvolutedmodelthatpassesthroughallthedatapoints.This modelhasessentiallynobias,asithasnoerrorwhenpredictingthedatathatitwasbuilt with.However,themodelisclearlypickinguponrandomnoiseinthedataset,andifthe modelwereusedtopredictnewdata,therewouldbesignificanterror.Ifthesamegeneral modelwererebuiltwithnewdata,themodelwouldchangesignificantly(highvariance). Asaresult,themodelisnotgeneralizabletonewdata.Modelslikethissufferfrom overfitting,whichoftenoccurswhenoverlycomplicatedoroverlyflexiblemodelsare fittedtodata—especiallywhensamplesizeislacking. Incontrast,themodelontherightpanelofFigure8.9isasimplemodel(thesimplest, actually).Itisjustahorizontallineatthemeanofthedependentvariable,mpg.Thisdoesa prettyterriblejobmodelingthevarianceinthedependentvariable,andexhibitshighbias. Thismodeldoeshaveoneattractivepropertythough—themodelwillbarelychangeatall iffittonewdata;thehorizontallinewilljustmoveupordownslightlybasedonthemean ofthempgcolumnofthenewdata. Todemonstratethatourkitchensinkregressionputsusonthewrongsideoftheoptimal pointinthebias-variancetradeoff,wewilluseamodelvalidationandassessment techniquecalledcross-validation. Cross-validation Giventhatthegoalofpredictiveanalyticsistobuildgeneralizablemodelsthatpredict wellfordatayetunobserved,weshouldideallybetestingourmodelsondataunseen,and checkourpredictionsagainsttheobservedoutcomes.Theproblemwiththat,ofcourse,is thatwedon’tknowtheoutcomesofdataunseen—that’swhywewantapredictivemodel. Wedo,however,haveatrickupoursleeve,calledthevalidationsetapproach. Thevalidationsetapproachisatechniquetoevaluateamodel’sabilitytoperformwellon anindependentdataset.Butinsteadofwaitingtogetourhandsonacompletelynew dataset,wesimulateanewdatasetwiththeonewealreadyhave. Themainideaisthatwecansplitourdatasetintotwosubsets;oneofthesesubsets(called thetrainingset)isusedtofitourmodel,andthentheother(thetestingset)isusedtotest theaccuracyofthatmodel.Sincethemodelwasbuiltbeforeevertouchingthetestingset, thetestingsetservesasanindependentdatasourceofpredictionaccuracyestimates, unbiasedbythemodel’sprecisionattributabletoitsmodelingofidiosyncraticnoise. Togetatourpredictiveaccuracybyperformingourownvalidationset,let’susethe samplefunctiontodividetherowindicesofmtcarsintotwoequalgroups,createthe subsets,andtrainamodelonthetrainingset: >set.seed(1) >train.indices<-sample(1:nrow(mtcars),nrow(mtcars)/2) >training<-mtcars[train.indices,] >testing<-mtcars[-train.indices,] >model<-lm(mpg~.,data=training) >summary(model) …..(outputtruncated) Residualstandarderror:1.188on5degreesoffreedom MultipleR-squared:0.988,AdjustedR-squared:0.9639 F-statistic:41.06on10and5DF,p-value:0.0003599 Beforewegoon,notethatthemodelnowexplainsawhopping99%ofthevariancein mpg.Any thishighshouldbearedflag;I’veneverseenalegitimatemodelwithanRsquaredthishighonanon-contriveddataset.Theincreasein isattributableprimarily duetothedecreaseinobservations(from32to16)andtheresultantincreasedopportunity tomodelspuriouscorrelations. Let’scalculatetheMSEofthemodelonthetrainingdataset.Todothis,wewillbeusing thepredictfunctionwithoutthenewdataargument,whichtellsusthemodelitwould predictongiventhetrainingdata(thesearereferredtoasthefittedvalues): >mean((predict(model)-training$mpg)^2) [1]0.4408109 #Cool,buthowdoesitperformonthevalidationset? >mean((predict(model,newdata=testing)-testing$mpg)^2) [1]337.9995 Myword! Inpractice,theerroronthetrainingdataisalmostalwaysalittlelessthantheerroronthe testingdata.However,adiscrepancyintheMSEbetweenthetrainingandtestingsetas largeasthisisaclear-as-dayindicationthatourmodeldoesn’tgeneralize. Let’scomparethismodel’svalidationsetperformancetoasimplermodelwithalower ,whichonlyusesamandwtaspredictors: >simpler.model<-lm(mpg~am+wt,data=training) >mean((predict(simpler.model)-training$mpg)^2) [1]9.396091 >mean((predict(simpler.model,newdata=testing)-testing$mpg)^2) [1]12.70338 NoticethattheMSEonthetrainingdataismuchhigher,butourvalidationsetMSEis muchlower. Ifthegoalistoblindlymaximizethe ,themorepredictors,thebetter.Ifthegoalisa generalizableandusefulpredictivemodel,thegoalshouldbetominimizethetestingset MSE. Thevalidationsetapproachoutlinedinthepreviousparagraphhastwoimportant drawbacks.Forone,themodelwasonlybuiltusinghalfoftheavailabledata.Secondly, weonlytestedthemodel’sperformanceononetestingset;attheslightofamagician’s hand,ourtestingsetcouldhavecontainedsomebizarrehard-to-predictexamplesthat wouldmakethevalidationsetMSEtoolarge. Considerthefollowingchangetotheapproach:wedividethedataup,justasbefore,into setaandsetb.Then,wetrainthemodelonseta,testitonsetb,thentrainitonbandtest itona.Thisapproachhasaclearadvantageoverourpreviousapproach,becauseit averagestheout-of-sampleMSEoftwotestingsets.Additionally,themodelwillnowbe informedbyallthedata.Thisiscalledtwo-foldcrossvalidation,andthegeneral techniqueiscalledk-foldcrossvalidation. Note Thecoefficientsofthemodelwill,ofcourse,bedifferent,buttheactualdatamodel(the variablestoincludeandhowtofittheline)willbethesame. Toseehowk-foldcrossvalidationworksinamoregeneralsense,considertheprocedure toperformk-foldcrossvalidationwherek=5.First,wedividethedataintofiveequal groups(setsa,b,c,d,ande),andwetrainthemodelonthedatafromsetsa,b,c,andd. ThenwerecordtheMSEofthemodelagainstunseendatainsete.Werepeatthisfour moretimes—leavingoutadifferentsetandtestingthemodelwithit.Finally,theaverage ofourfiveout-of-sampleMSEsisourfive-foldcrossvalidatedMSE. Yourgoal,now,shouldbetoselectamodelthatminimizesthek-foldcrossvalidation MSE.Commonchoicesofkare5and10. Toperformk-foldcrossvalidation,wewillbeusingthecv.glmfunctionfromtheboot package.Thiswillalsorequireustobuildourmodelsusingtheglmfunction(thisstands forgeneralizedlinearmodels,whichwe’lllearnaboutinthenextchapter)insteadoflm. Forcurrentpurposes,itisadrop-inreplacement: >library(boot) >bad.model<-glm(mpg~.,data=mtcars) >better.model<-glm(mpg~am+wt+qsec,data=mtcars) > >bad.cv.err<-cv.glm(mtcars,bad.model,K=5) >#thecross-validatedMSEestimatewewillbeusing >#isabias-correctedonestoredasthesecondelement >#inthe'delta'vectorofthecv.errobject >bad.cv.err$delta[2] [1]14.92426 > >better.cv.err<-cv.glm(mtcars,better.model,K=5) >better.cv.err$delta[2] [1]7.944148 Theuseofk-foldcrossvalidationoverthesimplevalidationsetapproachhasillustrated thatthekitchen-sinkmodelisnotasbadaswepreviouslythought(becausewetrainedit usingmoredata),butitisstilloutperformedbythefarsimplermodelthatincludesonly am,wt,andqsecaspredictors. Thisout-performancebyasimplemodelisnoidiosyncrasyofthisdataset;itisawellobservedphenomenoninpredictiveanalytics.Simplermodelsoftenoutperformoverly complicatedmodelsbecauseoftheresistanceofasimplermodeltooverfitting.Further, simplermodelsareeasiertointerpret,tounderstand,andtouse.Theideathat,giventhe samelevelofpredictivepower,weshouldprefersimplermodelstocomplicatedonesis expressedinafamousprinciplecalledOccam’sRazor. Finally,wehaveenoughbackgroundinformationtodiscusstheonlypieceofthelm summaryoutputwehaven’ttoucheduponyet:adjustedR-squared.Adjusted attempts totakeintoaccountthefactthatextraneousvariablesthrownintoalinearmodelwill alwaysincreaseits .Adjusted ,therefore,takesthenumberofpredictorsinto account.Assuch,itpenalizescomplexmodels.Adjusted willalwaysbeequaltoor lowerthannon-adjusted (itcanevengonegative!).Theadditionofeachmarginal predictorwillonlycauseanincreaseinadjustedifitcontributessignificantlytothe predictivepowerofthemodel,thatis,morethanwouldbedictatedbychance.Ifit doesn’t,theadjusted willdecrease.Adjusted hassomegreatproperties,andasa result,manywilltrytoselectmodelsthatmaximizetheadjusted ,butIpreferthe minimizationofcross-validatedMSEasmymainmodelselectioncriterion. Compareforyourselftheadjusted andqsec. ofthekitchen-sinkmodelandamodelusingam,wt, Strikingabalance AsFigure8.10depicts,asamodelbecomesmorecomplicated/flexible—asitstartsto includemoreandmorepredictors—thebiasofthemodelcontinuestodecrease.Alongthe complexityaxis,asthemodelbeginstofitthedatabetterandbetter,thecross-validation errordecreasesaswell.Atacertainpoint,themodelbecomesoverlycomplex,andbegins tofitidiosyncraticnoiseinthetrainingdataset—itoverfits!Thecross-validationerror beginstoclimbagain,evenasthebiasofthemodelapproachesitstheoreticalminimum! Theveryleftoftheplotdepictsmodelswithtoomuchbias,butlittlevariance.Theright sideoftheplotdepictsmodelsthathaveverylowbias,butveryhighvariance,andthus, areuselesspredictivemodels. Figure8.10:Asmodelcomplexity/flexibilityincreases,trainingerror(bias)tendstobe reduced.Uptoacertainpoint,thecross-validationerrordecreasesaswell.Afterthat point,thecross-validationerrorstartstogoupagain,evenasthemodel’sbiascontinues todecrease.Afterthispoint,themodelistooflexibleandoverfits. Theidealpointinthisbias-variancetradeoffisatthepointwherethecross-validation error(notthetrainingerror)isminimized. Okay,sohowdowegetthere? Althoughtherearemoreadvancedmethodsthatwe’lltouchoninthesectioncalled AdvancedTopics,atthisstageofthegame,ourprimaryrecourseforfindingourbiasvariancetradeoffsweetspotiscarefulfeatureselection. Instatisticallearningparlance,featureselectionreferstoselectingwhichpredictor variablestoincludeinourmodel(forsomereason,theycallpredictorvariablesfeatures). Iemphasizedthewordcareful,becausethereareplentyofdangerouswaystodothis.One suchmethod—andperhapsthemostintuitive—istosimplybuildmodelscontainingevery possiblesubsetoftheavailablepredictors,andchoosethebestoneasmeasuredby Adjusted ortheminimizationofcross-validatederror.Probably,thebiggestproblem withthisapproachisthatit’scomputationallyveryexpensive—tobuildamodelforevery possiblesubsetofpredictorsinmtcars,youwouldneedtobuild(andcrossvalidate)1,023 differentmodels.Thenumberofpossiblemodelsrisesexponentiallywiththenumberof predictors.Becauseofthis,formanyreal-worldmodelingscenarios,thismethodisoutof thequestion. Thereisanotherapproachthat,forthemostpart,solvestheproblemofthecomputational intractabilityoftheall-possible-subsetsapproach:step-wiseregression. Stepwiseregressionisatechniquethatprogrammaticallytestsdifferentpredictor combinationsbyaddingpredictorsin(forwardstepwise),ortakingpredictorsout (backwardstepwise)accordingthevaluethateachpredictoraddstothemodelas measuredbyitsinfluenceontheadjusted .Therefore,liketheall-possible-subsets approach,stepwiseregressionautomatestheprocessoffeatureselection. Note Incaseyoucare,themostpopularimplementationofthistechnique(thestepAICfunction intheMASSpackage)inRdoesn’tmaximizeAdjusted but,instead,minimizesarelated modelqualitymeasurecalledtheAkaikeInformationCriterion(AIC). Therearenumerousproblemswiththisapproach.Theleastoftheseisthatitisnot guaranteedtofindthebestpossiblemodel. Oneoftheprimaryissuesthatpeopleciteisthatitresultsinlazysciencebyabsolvingus oftheneedtothinkouttheproblem,becauseweletanautomatedproceduremake decisionsforus.Thisschoolofthoughtusuallyholdsthatmodelsshouldbeinformed,at leastpartially,bysomeamountoftheoryanddomainexpertise. Itisforthesereasonsthatstepwiseregressionhasfallenoutoffavoramongmany statisticians,andwhyI’mchoosingnottorecommendusingit. Stepwiseregressionislikealcohol:somepeoplecanuseitwithoutincident,butsome can’tuseitsafely.Itisalsolikealcoholinthatifyouthinkyouneedtouseit,you’vegota bigproblem.Finally,neithercanbeadvertisedtochildren. Atthisstageofthegame,Isuggestthatyourmainapproachtobalancingbiasand varianceshouldbeinformedtheory-drivenfeatureselection,andpayingcloseattentionto k-foldcrossvalidationresults.Incaseswhereyouhaveabsolutelynotheory,Isuggest usingregularization,atechniquethatis,unfortunately,beyondthescopeofthistext.The sectionAdvancedtopicsbrieflyextolsthevirtuesofregularization,ifyouwantmore information. Linearregressiondiagnostics IwouldbenegligentifIfailedtomentiontheboringbutverycriticaltopicofthe assumptionsoflinearmodels,andhowtodetectviolationsofthoseassumptions.Justlike theassumptionsofthehypothesistestsinChapter6,TestingHypotheseslinearregression hasitsownsetofassumptions,theviolationofwhichjeopardizetheaccuracyofour model—andanyinferencesderivedfromit—tovaryingdegrees.Thechecksandteststhat ensuretheseassumptionsaremetarecalleddiagnostics. Therearefivemajorassumptionsoflinearregression: Thattheerrors(residuals)arenormallydistributedwithameanof0 Thattheerrortermsareuncorrelated Thattheerrorshaveaconstantvariance Thattheeffectoftheindependentvariablesonthedependentvariablearelinearand additive Thatmulti-collinearityisataminimum We’llbrieflytouchontheseassumptions,andhowtocheckfortheminthissectionhere. Todothis,wewillbeusingaresidual-fittedplot,sinceitallowsus,withsomeskill,to verifymostoftheseassumptions.Toviewaresidual-fittedplot,justcalltheplotfunction onyourlinearmodelobject: >my.model<-lm(mpg~wt,data=mtcars) >plot(my.model) Thiswillshowyouaseriesoffourdiagnosticplots—theresidual-fittedplotisthefirst. Youcanalsoopttoviewjusttheresidual-fittedplotwiththisrelatedincantation: >plot(my.model,which=1) WearealsogoingbacktoAnscombe’sQuartet,sincethequartet’saberrantrelationships collectivelyillustratetheproblemsthatyoumightfindwithfittingregressionmodelsand assumptionviolation.Tore-familiarizeyourselfwiththequartet,lookbacktoFigure8.6. SecondAnscomberelationship ThefirstrelationshipinAnscombe’sQuartet(y1~x1)istheonlyonethatcan appropriatelybemodeledwithlinearregressionasis.Incontrast,thesecondrelationship (y2~x2)depictsarelationshipthatviolatestherequirementofalinearrelationship.Italso subtlyviolatestheassumptionofnormallydistributedresidualswithameanofzero.To seewhy,refertoFigure8.11,whichdepictsitsresidual-fittedplot: Figure8.11:ThetoptwopanelsshowthefirstandsecondrelationshipsofAnscombe’s quartet,respectively.Thebottomtwopanelsdepicteachtoppanel’srespectiveresidualfittedplot Anon-pathologicalresidual-fittedplotwillhavedatapointsrandomlydistributedalong theinvisiblehorizontalline,wherethey-axisequals0.Bydefault,thisplotalsocontainsa smoothcurvethatattemptstofittheresiduals.Inanon-pathologicalsample,thissmooth curveshouldbeapproximatelystraight,andstraddlethelineaty=0. Asyoucansee,thefirstAnscomberelationshipdoesthiswell.Incontrast,thesmooth curveofthesecondrelationshipisaparabola.Theseresidualscouldhavebeendrawn fromanormaldistributionwithameanofzero,butitishighlyunlikely.Instead,itlooks liketheseresidualsweredrawnfromadistribution—perhapsfromanormaldistribution— whosemeanchangedasafunctionofthex-axis.Specifically,itappearsasiftheresiduals atthetwoendsweredrawnfromadistributionwhosemeanwasnegative,andthemiddle residualshadapositivemean. ThirdAnscomberelationship Wealreadydugdeeperintothisrelationshipwhenwespokeofrobustregressionearlierin thechapter.Wesawthatarobustfitofthisrelationshipmoreoflessignoredtheclear outlier.Indeed,therobustfitisalmostidenticaltothenon-robustlinearfitaftertheoutlier isremoved. Onoccasion,adatapointthatisanoutlierinthey-axisbutnotthex-axis(likethisone) doesn’tinfluencetheregressionlinemuch—meaningthatitsomissionwouldn’tcausea substantialchangeintheestimatedinterceptandcoefficients. Adatapointthatisanoutlierinthex-axis(oraxes)issaidtohavehighleverage. Sometimes,pointswithhighleveragedon’tinfluencetheregressionlinemuch,either. However,datapointsthathavehighleverageandareoutliersveryoftenexerthigh influenceontheregressionfit,andmustbehandledappropriately. Refertotheupper-rightpanelofFigure8.12.Theaberrantdatapointinthefourth relationshipofAnscombe’squartethasveryhighleverageandhighinfluence.Notethat theslopeoftheregressionlineiscompletelydeterminedbythey-positionofthatpoint. FourthAnscomberelationship Thefollowingimagedepictssomeofthelinearregressiondiagnosticplotsofthefourth Anscomberelationship: Figure8.12:ThefirstandthefourthAnscomberelationshipsandtheirrespectiveresidualfittedplots Althoughit’sdifficulttosayforsure,thisisprobablyinviolationoftheassumptionof constantvarianceofresiduals(alsocalledhomogeneityofvarianceorhomoscedasticityif you’reafancy-pants). Amoreillustrativeexampleoftheviolationofhomoscedasticity(orheteroscedasticity)is showninFigure8.13: Figure8.13:Aparadigmaticdepictionoftheresidual-fittedplotofaregressionmodelfor whichtheassumptionofhomogeneityofvarianceisviolated Theprecedingplotdepictsthecharacteristicfunnelshapesymptomaticofresidual-fitted plotsofoffendingregressionmodels.Noticehowontheleft,theresidualsvaryverylittle, butthevariancesgrowasyougoalongthex-axis. Bearinmindthattheresidual-fittedplotneednotresembleafunnel—anyresidual-fitted plotthatveryclearlyshowsthevariancechangeasafunctionofthex-axis,violatesthis assumption. LookingbackonAnscombe’sQuartet,youmaythinkthatthethreerelationships’ unsuitabilityforlinearmodelingwasobvious,andyoumaynotimmediatelyseethe benefitofdiagnosticplots.Butbeforeyouwriteofftheart(notscience)oflinear regressiondiagnostics,considerthatthesewereallrelationshipswithasinglepredictor.In multipleregression,withtensofpredictors(ormore),itisverydifficulttodiagnose problemsbyjustplottingdifferentcutsofthedata.Itisinthisdomainwherelinear regressiondiagnosticsreallyshine. Finally,thelasthazardtobemindfulofwhenlinearlyregressingistheproblemof collinearityormulticollinearity.Collinearityoccurswhentwo(ormore)predictorsare veryhighlycorrelated.Thiscausesmultipleproblemsforregressionmodels,including highlyuncertainandunstablecoefficientestimates.Anextremeexampleofthiswouldbe ifwearetryingtopredictweightfromheight,andwehadbothheightinfeetandheightin metersaspredictors.Initsmostsimplecase,collinearitycanbecheckedforbylookingat thecorrelationmatrixofalltheregressors(usingthecorfunction);anycellthathasahigh correlationcoefficientimplicatestwopredictorsthatarehighlycorrelatedand,therefore, holdredundantinformationinthemodel.Intheory,oneofthesepredictorsshouldbe removed. Amoresneakyissuepresentsitselfwhentherearenotwoindividualpredictorsthatare highlycorrelated,buttherearemultiplepredictorsthatarecollectivelycorrelated.Thisis multicollinearity.Thiswouldoccurtoasmallextent,forexample,ifinsteadofpredicting mpgfromothervariablesinthemtcarsdataset,weweretryingtopredicta(non-existent) newvariableusingmpgandtheotherpredictors.Sinceweknowthatmpgcanbefairly reliablyestimatedfromsomeoftheothervariablesinmtcars,whenitisapredictorina regressionmodelinganothervariable,itwouldbedifficulttotellwhetherthetarget’s varianceistrulyexplainedbympg,orwhetheritisexplainedbympg‘spredictors. Themostcommontechniquetodetectmulticollinearityistocalculateeachpredictor variable’sVarianceInflationFactor(VIF).TheVIFmeasureshowmuchlargerthe varianceofacoefficientisbecauseofitscollinearity.Mathematically,theVIFofa predictor, ,is: where isthe ofalinearmodelpredicting fromallotherpredictors( ). Assuch,theVIFhasalowerboundofone(inthecasethatthepredictorcannotbe predictedaccuratelyfromtheotherpredictors).Itsupperboundisasymptoticallyinfinite. Ingeneral,mostviewVIFsofmorethanfourascauseforconcern,andVIFsof10or aboveindicativeofaveryhighdegreeofmulticollinearity.YoucancalculateVIFsfora model,posthoc,withtheviffunctionfromthecarpackage: >model<-lm(mpg~am+wt+qsec,data=mtcars) >library(car) >vif(model) amwtqsec 2.5414372.4829521.364339 Advancedtopics Linearmodelsarethebiggestideainappliedstatisticsandpredictiveanalytics.Thereare massivevolumeswrittenaboutthesmallestdetailsoflinearregression.Assuch,thereare someimportantideasthatwecan’tgooverherebecauseofspaceconcerns,orbecauseit requiresknowledgebeyondthescopeofthisbook.Soyoudon’tfeellikeyou’reinthe dark,though,herearesomeofthetopicswedidn’tcover—andthatIwouldhavelikedto —andwhytheyareneat. Regularization:Regularizationwasmentionedbrieflyinthesubsectionabout balancingbiasandvariance.Inthiscontext,regularizationisatechniquewhereinwe penalizemodelsforcomplexity,tovaryingdegrees.Myfavoritemethodof regularizinglinearmodelsisbyusingelastic-netregression.Itisafantastic techniqueand,ifyouareinterestedinlearningmoreaboutit,Isuggestyouinstall andreadthevignetteoftheglmnetpackage: >install.packages("glmnet") >library(glmnet) >vignette("glmnet_beta") Non-linearmodeling:Surprisingly,wecanmodelhighlynon-linearrelationships usinglinearregression.Forexample,let’ssaywewantedtobuildamodelthat predictshowmanyraisinstouseforacookieusingthecookie’sradiusasapredictor. Therelationshipbetweenpredictorandtargetisnolongerlinear—it’squadratic. However,ifwecreateanewpredictorthatistheradiussquared,thetargetwillnow havealinearrelationshipwiththenewpredictor,andthus,canbecapturedusing linearregression.Thisbasicpremisecanbeextendedtocapturerelationshipsthatare cubic(powerof3),quartic(powerof4),andsoon;thisiscalledpolynomial regression.Otherformsofnon-linearmodelingdon’tusepolynomialfeatures,but instead,directlyfitnon-linearfunctionstothepredictors.Amongtheseformsinclude regressionsplinesandGeneralizedAdditiveModels(GAMs). Interactionterms:Justliketherearegeneralizationsoflinearregressionthatremove therequirementoflinearity,sotooaretheregeneralizationsoflinearregressionsthat eliminatetheneedforthestrictlyadditiveandindependenteffectsbetween predictors. Takegrapefruitjuice,forexample.Grapefruitjuiceiswellknowntoblockintestinal enzymeCYP3A,anddrasticallyeffecthowthebodyabsorbscertainmedicines.Let’s pretendthatgrapefruitjuicewasmildlyeffectiveattreatingexistentialdysphoria. AndsupposethereisadrugcalledSomathatwashighlyeffectiveattreatingthis condition.Whenalleviationofsymptomsisplottedasafunctionofdose,the grapefruitjuicewillhaveaverysmallslope,buttheSomawillhaveaverylarge slope.Now,ifwealsopretendthatgrapefruitjuiceincreasestheefficiencyofSoma absorption,thenthereliefofdysphoriaofsomeonetakingbothgrapefruitjuiceand Somawillbefarhigherthanwouldbepredictedbyamultipleregressionmodelthat doesn’ttakeintoaccountthesynergisticeffectsofSomaandthejuice.Thesimplest waytomodelthisinteractioneffectistoincludetheinteractionterminthelm formula,likeso: >my.model<-lm(relief~soma*juice,data=my.data) whichbuildsalinearregressionformulaofthefollowingform: whereif islargerthan and thenthereisaninteractioneffectthatisbeing modeled.Ontheotherhand,if iszeroand and arepositive,thatsuggeststhat thegrapefruitjuicecompletelyblockstheeffectofSoma(andviceversa). Bayesianlinearregression:Bayesianlinearregressionisanalternativeapproachto theprecedingmethodsthatoffersalotofcompellingbenefits.Oneofthemajor benefitsofBayesianlinearregression—whichechoesthebenefitsofBayesian methodsasawhole—isthatweobtainaposteriordistributionofcrediblevaluesfor eachofthebetacoefficients.Thismakesiteasytomakeprobabilisticstatements aboutintervalsinwhichthepopulationcoefficientislikelytolie.Thismakes hypothesistestingveryeasy. Anothermajorbenefitisthatwearenolongerheldhostagetotheassumptionthatthe residualsarenormallydistributed.Ifyouwerethegoodpersonyoulayclaimtobeing onyouronlinedatingprofiles,youwouldhavedonetheexercisesattheendofthe lastchapter.Ifso,youwouldhaveseenhowwecouldusethet-distributiontomake ourmodelsmorerobusttotheinfluenceofoutliers.InBayesianlinearregression,it iseasytouseat-distributedlikelihoodfunctiontodescribethedistributionofthe residuals.Lastly,byadjustingthepriorsonthebetacoefficientsandmakingthem sharplypeakedatzero,weachieveacertainamountofshrinkageregularizationfor free,andbuildmodelsthatareinherentlyresistanttooverfitting. Exercises Practicethefollowingexercisestorevisetheconceptslearnedthusfar: Byfar,thebestwaytobecomecomfortableandlearnthein-and-outsofapplied regressionanalysisistoactuallycarryoutregressionanalyses.Tothisend,youcan usesomeofthemanydatasetsthatareincludedinR.Togetafulllistingofthe datasetsinthedatasetspackage,executethefollowing: >help(package="datasets") TherearehundredsofmoredatasetsspreadacrosstheotherseveralthousandR packages.Evenbetter,loadyourowndatasets,andattempttomodelthem. Examineandplotthedatasetpressure,whichdescribestherelationshipbetweenthe vaporpressureofmercuryandtemperature.Whatassumptionoflinearregression doesthisviolate?Attempttomodelthisusinglinearregressionbyusing temperaturesquaredasapredictor,likethis: >lm(pressure~I(temperature^2),data=pressure) Comparethefitbetweenthemodelthatusesthenon-squaredtemperatureandthis one.Explorecubicandquarticrelationshipsbetweentemperatureandpressure. Howaccuratelycanyoupredictpressure?Employcross-validationtomakesure thatnooverfittinghasoccurred.Marvelathownicelyphysicsplayswithstatistics sometimes,andwishthatthebehavioralscienceswouldbehavebetter. Keepaneyeoutforprovocativenewsandhuman-intereststoriesorpopularculture anecdotesthatclaimsuspectcausalrelationshipslikegumchewingcausesheart diseaseordarkchocolatepromotesweightloss.Iftheseclaimswerebackedupusing datafromnaturalexperiments,trytothinkofpotentialconfoundingvariablesthat invalidatetheclaim.Impressuponyourfriendsandfamilythatthemediaistryingto takeadvantageoftheirgullibilityandnon-fluencyintheprinciplesofstatistics.As youbecomemoreadeptatrecognizingsuspiciousclaims,you’llbeinvitedtofewer andfewerparties.Thiswillclearupyourscheduleformorestudying. TowhatextentcanMikhailGorbachev’srevisionismoflateStalinismbeviewedasa precipitatingfactorinthefalloftheBerlinWall?Exceptionalresponseswilladdress theeffectsofWesterninterpretationsofMarxonthepost-warSovietIntelligentsia. Summary Whew,we’vebeenthroughalotinthischapter,andIcommendyouforstickingitout. Yourtenacitywillbewellrewardedwhenyoustartusingregressionanalysisinyourown projectsorresearchlikeaprofessional. Westartedoffwiththebasics:howtodescribealine,simplelinearrelationships,andhow abest-fitregressionlineisdetermined.YousawhowwecanuseRtoeasilyplotthese best-fitlines. Wewentontoexploreregressionanalysiswithmorethanonepredictor.Youlearnedhow tointerprettheloquaciouslmsummaryoutput,andwhateverythingmeant.Inthecontext ofmultipleregression,youlearnedhowthecoefficientsareproperlyinterpretedasthe effectofapredictorcontrollingforallotherpredictors.You’renowawarethatcontrolling forandthinkingaboutconfoundsisoneofthecornerstonesofstatisticalthinking. Wediscoveredthatweweren’tlimitedtousingcontinuouspredictors,andthat,using dummycoding,wecannotonlymodeltheeffectsofcategoricalvariables,butalso replicatethefunctionalities,two-samplet-testandone-wayANOVA. Youlearnedofthehazardsofgoinghog-wildandincludingallavailablepredictorsina linearmodel.Specifically,you’vecometofindoutthatrecklesspursuitofR^2 maximizationisalosingstrategywhenitcomestobuildinginterpretable,generalizable, andusefulmodels.You’velearnedthatitisfarbettertominimizeout-of-sampleerror usingestimatesfromcrossvalidation.Weframedthispreferencefortesterror minimizationoftrainingerrorminimizationintermsofthebias-variancetradeoff. Penultimately,youlearnedthestandardassumptionsoflinearregressionandtouchedupon somewaystodeterminewhetherourassumptionshold.Youcametounderstandthat regressiondiagnosticsisn’tanexactscience. Lastly,youlearnedthatthere’smuchwehaven’tlearnedaboutregressionanalysis.This willkeepushumbleandhungryformoreknowledge. Chapter9.PredictingCategorical Variables Ourfirstforayintopredictiveanalyticsbeganwithregressiontechniquesforpredicting continuousvariables.Inthischapter,wewillbediscussingaperhapsevenmorepopular classoftechniquesfromstatisticallearningknownasclassification. Allthesetechniqueshaveatleastonethingincommon:wetrainalearneroninput,for whichthecorrectclassificationsareknown,withtheintentionofusingthetrainedmodel onnewdatawhoseclassisunknown.Inthisway,classificationisasetofalgorithmsand methodstopredictcategoricalvariables. Whetheryouknowitornot,statisticallearningalgorithmsperformingclassificationare allaroundyou.Forexample,ifyou’veeveraccidentlycheckedtheSpamfolderofyouremailandbeenhorrified,youcanthankyourluckystarsthattherearesophisticated classificationmechanismsthatyoure-mailisrunthroughtoautomaticallymarkspamas suchsoyoudon’thavetoseeit.Ontheotherhand,ifyou’veeverhadalegitimatee-mail senttospam,oraspame-mailsneakpastthespamfilterintoyourinbox,you’ve witnessedthelimitationsofclassificationalgorithmsfirsthand:sincethee-mailsaren’t beingauditedbyahumanone-by-one,andarebeingauditedbyacomputerinstead, misclassificationhappens.Justlikeourlinearregressionpredictionsdifferedfromour trainingdatatovaryingdegrees,sotoodoclassificationalgorithmsmakemistakes.Our jobistomakesurewebuildmodelsthatminimizethesemisclassifications—ataskwhich isnotalwayseasy. TherearemanydifferentclassificationmethodsavailableinR;wewillbelearningabout fourofthemostpopularonesinthischapter—startingwithk-NearestNeighbors. k-NearestNeighbors You’reatatrainterminallookingfortherightlinetostandintogetonthetrainfrom UpstateNYtoPennStationinNYC.You’vesettledintowhatyouthinkistherightline, butyou’restillnotsurebecauseit’ssocrowdedandchaotic.Notwantingtowaitinthe wrongline,youturntothepersonclosesttoyouandaskthemwherethey’regoing:“Penn Station,”saysthestranger,blithely. Youdecidetogetsomesecondopinions.Youturntothesecondclosestpersonandthe thirdclosestpersonandaskthemseparately:PennStationandNovaScotiarespectively. Thegeneralconsensusseemstobethatyou’reintherightline,andthat’sgoodenoughfor you. Ifyou’veunderstoodtheprecedinginteraction,youalreadyunderstandtheideabehindkNearestNeighbors(k-NNhereafter)onafundamentallevel.Inparticular,you’vejust performedk-NN,wherek=3.Hadyoujuststoppedatthefirstperson,youwouldhave performedk-NN,wherek=1. So,k-NNisaclassificationtechniquethat,foreachdatapointwewanttoclassify,finds thekclosesttrainingdatapointsandreturnstheconsensus.Intraditionalsettings,themost commondistancemetricisEuclideandistance(which,intwodimensions,isequaltothe distancefrompointatopointbgivenbythePythagoreanTheorem).Anothercommon distancemetricisManhattandistance,which,intwodimensions,isequaltothesumof thelengthofthelegsofthetriangleconnectingtwodatapoints. Figure9.1:TwopointsonaCartesianplane.TheirEuclideandistanceis5.Their Manhattandistanceis3+4=7 k-NearestNeighborsisabitofanoddballtechnique;moststatisticallearningmethods attempttoimposeaparticularmodelonthedataandestimatetheparametersofthat model.Putanotherway,thegoalofmostlearningmethodsistolearnanobjectivefunction thatmapsinputstooutputs.Oncetheobjectivefunctionislearned,thereisnolongera needforthetrainingset. Incontrast,k-NNlearnsnosuchobjectivefunction.Rather,itletsthedataspeakfor themselves.Sincethereisnoactuallearning,perse,goingon,k-NNneedstoholdonto trainingdatasetforfutureclassifications.Thisalsomeansthatthetrainingstepis instantaneous,sincethereisnotrainingtobedone.Mostofthetimespentduringthe classificationofadatapointisspentfindingitsnearestneighbors.Thispropertyofk-NN makesitalazylearningalgorithm. Sincenoparticularmodelisimposedonthetrainingdata,k-NNisoneofthemostflexible andaccurateclassificationlearnersthereare,anditisverywidelyused.Withgreat flexibility,though,comesgreatresponsibility—itisourresponsibilitythatweensurethat k-NNhasn’toverfitthetrainingdata. Figure9.2:Thespeciesclassificationregionsoftheirisdatasetusing1-NN InFigure9.2,weusethebuilt-inirisdataset.Thisdatasetcontainsfourcontinuous measurementsofirisflowersandmapseachobservationtooneofthreespecies:iris setosa(thesquarepoints),irisvirginica(thecircularpoints),andirisversicolor(the triangularpoints).Inthisexample,weuseonlytwooftheavailablefourattributesinour classificationforeaseofvisualization:sepalwidthandpetalwidth.Asyoucansee,each speciesseemstooccupyitsownlittlespaceinour2-Dfeaturespace.However,there seemstobealittleoverlapbetweentheversicolorandvirginicadatapoints.Becausethis classifierisusingonlyonenearestneighbor,thereappeartobesmallregionsoftraining data-specificidiosyncraticclassificationbehaviorwherevirginicasisencroachingthe versicolorclassificationregion.Thisiswhatitlookslikewhenourk-NNoverfitsthedata. Inourtrainstationmetaphor,thisistantamounttoaskingonlyoneneighborwhatline you’reonandthemisinformed(ormalevolent)neighbortellingyouthewronganswer. k-NNclassifiersthathaveoverfithavetradedlowvarianceforlowbias.Itiscommonfor overfitk-NNclassifierstohavea0%misclassificationrateonthetrainingdata,butsmall changesinthetrainingdataharshlychangetheclassificationregions(highvariance).Like withregression(andtherestoftheclassifierswe’llbelearningaboutinthischapter),we aimtofindtheoptimalpointinthebias-variancetradeoff—theonethatminimizeserrorin anindependenttestingset,andnotonethatminimizestrainingsetmisclassificationerror. Wedothisbymodifyingthekink-NNandusingtheconsensusofmoreneighbors. Beware-ifyouasktoomanyneighbors,youstarttotaketheanswersofratherdistant neighborsseriously,andthiscanalsoadverselyaffectaccuracy.Findingthe“sweetspot”, wherekisneithertoosmallortwolarge,iscalledhyperparameteroptimization(because kiscalledahyperparameterofk-NN). Figure9.3:Thespeciesclassificationregionsoftheirisdatasetusing15-NN.The boundariesbetweentheclassificationregionsarenowsmootherandlessoverfit CompareFigure9.2toFigure9.3,whichdepictstheclassificationregionsoftheiris classificationtaskusing15nearestneighbors.Theaberrantvirginicasarenolonger carvingouttheirownterritoryinversicolor’sregion,andtheboundariesbetweenthe classificationregions(alsocalleddecisionboundaries)arenowsmoother—oftenatraitof classifiersthathavefoundthesweetspotinthebias-variancetradeoff.Onecouldimagine thatnewtrainingdatawillnolongerhavesuchadrasticeffectonthedecisionboundaries —atleastnotasmuchaswiththe1-NNclassifier. Note Intheirisflowerexample,andthenextexample,wedealwithcontinuouspredictorsonly. K-NNcanhandlecategoricalvariables,though—notunlikehowwedummycoded categoricalvariablesinlinearregressioninthelastchapter!Thoughwedidn’ttalkabout how,regression(andk-NN)handlesnon-binarycategoricalvariables,too.Canyouthink ofhowthisisdone?Hint:wecan’tusejustonedummyvariableforanon-binary categoricalvariable,andthenumberofdummyvariablesneededisonelessthanthe numberofcategories. Usingk-NNinR Thedatasetwewillbeusingforalltheexamplesinthischapteristhe PimaIndiansDiabetesdatasetfromthemlbenchpackage.Thisdatasetispartofthedata collectedfromoneofthenumerousdiabetesstudiesonthePimaIndians,agroupof indigenousAmericanswhohaveamongthehighestprevalenceofTypeIIdiabetesinthe world—probablyduetoacombinationofgeneticfactorsandtheirrelativelyrecent introductiontoaheavilyprocessedWesterndiet.For768observations,ithasnine attributes,includingskinfoldthickness,BMI,andsoon,andabinaryvariable representingwhetherthepatienthaddiabetes.Wewillbeusingtheeightpredictor variablestotrainaclassifiertopredictwhetherapatienthasdiabetesornot. Thisdatasetwaschosenbecauseithasmanyobservationsavailable,hasagoodlyamount ofpredictorvariablesavailable,anditisaninterestingproblem.Additionally,itisnot unlikemanyothermedicaldatasetsthathaveafewpredictorsandabinaryclassoutcome (forexample,alive/dead,pregnant/not-pregnant,benign/malignant).Finally,unlikemany classificationdatasets,thisonehasagoodmixtureofbothclassoutcomes;thiscontains 35%diabetespositiveobservations.Grievouslyimbalanceddatasetscancauseaproblem withsomeclassifiersandimpairouraccuracyestimates. Togetthisdataset,wearegoingtorunthefollowingcommandstoinstallthenecessary package,loadthedata,andgivethedatasetanewnamethatisfastertotype: >#"class"isoneofthepackagesthatimplementk-NN >#"chemometrics"containsafunctionweneed >#"mlbench"holdsthedataset >install.packages(c("class","mlbench","chemometrics")) >library(class) >library(mlbench) >data(PimaIndiansDiabetes) >PID<-PimaIndiansDiabetes Now,let’sdivideourdatasetintoatrainingsetandatestingsetusingan80/20split. >#wesettheseedsothatoursplitsarethesame >set.seed(3) >ntrain<-round(nrow(PID)*4/5) >train<-sample(1:nrow(PID),ntrain) >training<-PID[train,] >testing<-PID[-train,] Nowwehavetochoosehowmanynearestneighborswewanttouse.Luckily,there’sa greatfunctioncalledknnEvalfromthechemometricspackagethatwillallowusto graphicallyvisualizetheeffectivenessofk-NNwithadifferentkusingcross-validation. Ourobjectivemeasuresofeffectivenesswillbethemisclassificationrate,or,thepercent oftestingobservationsthataremisclassified. >resknn<-knnEval(scale(PID[,-9]),PID[,9],train,kfold=10, +knnvec=seq(1,50,by=1), +legpos="bottomright") There’salotheretoexplain!Thefirstthreeargumentsarethepredictormatrix,the variablestopredict,andtheindicesofthetrainingdatasetrespectively.Notethatthe ninthcolumnofthePIDdataframeholdstheclasslabels—togetamatrixcontainingjust thepredictors,wecanremovetheninthcolumnbyusinganegativecolumnindex.The scalefunctionthatwecallonthepredictormatrixsubtractseachvalueineachcolumnby thecolumn’smeananddivideseachvaluebytheirrespectivecolumn’sstandarddeviation —itconvertseachvaluetoaz-score!Thisisusuallyimportantink-NNinorderforthe distancesbetweendatapointstobemeaningful.Forexample,thedistancebetweendata pointswouldchangedrasticallyifacolumnpreviouslymeasuredinmeterswerererepresentedasmillimeters.Thescalefunctionputsallthefeaturesincomparableranges regardlessoftheoriginalunits. Notethatforthethirdargument,wearenotsupplyingthefunctionwiththetrainingdata set,buttheindicesthatweusedtoconstructthetrainingdataset.Ifyouareconfused, inspectthevariousobjectswehaveinourworkspacewiththeheadfunction. Thefinalthreeargumentsindicatethatwewanttousea10-foldcross-validation,check everyvalueofkfrom1to50,andputthelegendinthelower-leftcorneroftheplot. TheplotthatthiscodeproducesisshowninFigure9.4: Figure9.4:Aplotillustratingtestseterror,cross-validatederror,andtrainingseterroras afunctionofkink-NN.Afteraboutk=15,thetestandCVerrordoesn’tappeartochange much Asyoucanseefromtheprecedingplot,afteraboutk=15,thetestandcross-validated misclassificationerrordon’tseemtochangemuch.Usingk=27seemslikeasafebet,as measuredbytheminimizationofCVerror. Note Toseewhatitlookslikewhenweunderfitandusetoomanyneighbors,checkoutFigure 9.5,whichexpandsthex-axisofthelastfiguretoshowthemisclassificationerrorofusing upto200neighbors.NoticethatthetestandCVerrorstartoffhigh(at1-NN)andquickly decrease.Atabout70-NN,though,thetestandCVerrorstarttorisesteadilyasthe classifierunderfits.Notealsothatthetrainingerrorstartsoutat0for1-NN(aswewould expect),butverysharplyquicklyincreasesasweaddmoreneighbors.Thisisagood reminderthatourgoalisnottominimizethetrainingseterrorbuttominimizeerroronan independentdataset—eitheratestsetoranestimateusingcross-validation. Figure9.5:Aplotillustratingtestseterror,cross-validatederror,andtrainingseterror andafunctionofkink-NNuptok=200.Noticehowerrorincreasesasthenumberof neighborsbecomestoolargeandcausestheclassifiertooverfit. Let’sperformthek-NN! >predictions<-knn(scale(training[,-9]), +scale(testing[,-9]), +training[,9],k=27) > >#functiontogivecorrectclassificationrate >accuracy<-function(predictions,answers){ +sum((predictions==answers)/(length(answers))) +} > >accuracy(predictions,testing[,9]) [1]0.7597403 Itlookslikeusing27-NNgaveusacorrectclassificationrateof76%(amisclassification rateof100%-76%=24%).Isthatgood?Well,let’sputitinperspective. Ifwerandomlyguessedwhethereachtestingobservationwaspositivefordiabetes,we wouldexpectaclassificationrateof50%.Butrememberthatthenumberofnon-diabetes observationsoutnumberthenumberofobservationsofdiabetes(non-diabetes observationsare65%ofthetotal).So,ifwebuiltaclassifierthatjustpredictedno diabetesforeveryobservation,wewouldexpecta65%correctclassificationrate.Luckily, ourclassifierperformssignificantlybetterthanournaïveclassifier,although,perhaps,not asgoodaswewouldhavehoped.Aswe’lllearnasthechaptermoveson,k-NNis competitivewiththeaccuracyofotherclassifiers—Iguessit’sjustareallyhardproblem! Confusionmatrices Wecangetamoredetailedlookatourclassifier’saccuracyviaaconfusionmatrix.You cangetRtogiveupaconfusionmatrixwiththefollowingcommand: >table(test[,9],preds) preds negpos neg869 pos2831 Thecolumnsinthismatrixrepresentourclassifier’spredictions;therowsrepresentthe trueclassificationsofourtestingsetobservations.IfyourecallfromChapter3, DescribingRelationships,thismeansthattheconfusionmatrixisacross-tabulation(or contingencytable)ofourpredictionsandtheactualclassifications.Thecellinthetop-left cornerrepresentsobservationsthatdidn’thavediabetesthatwecorrectlypredictedasnondiabetic(truenegatives).Incontrast,thecellinthelower-rightcornerrepresentstrue positives.Theupper-leftcellcontainsthecountoffalsepositives,observationsthatwe incorrectlypredictedashavingdiabetes.Finally,theremainingcellholdsthenumberof falsenegatives,ofwhichthereare28. Thisishelpfulforexaminingwhetherthereisaclassthatwearesystematically misclassifyingorwhetherourfalsenegativesandfalsepositivearesignificantly imbalanced.Additionally,thereareoftendifferentcostsassociatedwithfalsenegatives andfalsepositives.Forexample,inthiscase,thecostofmisclassifyingapatientasnondiabeticisgreat,becauseitimpedesourabilitytohelpatrulydiabeticpatient.Incontrast, misclassifyinganon-diabeticpatientasdiabetic,althoughnotideal,incursafarless grievouscost.Aconfusionmatrixletsusview,ataglance,justwhattypesoferrorsweare making.Fork-NN,andtheotherclassifiersinthischapter,therearewaystospecifythe costofeachtypeofmisclassificationinordertoexactaclassifieroptimizedfora particularcost-sensitivedomain,butthatisbeyondthescopeofthisbook. Limitationsofk-NN Beforewemoveon,weshouldtalkaboutsomeofthelimitationsofk-NN. First,ifyou’renotcarefultouseanoptimizedimplementationofk-NN,classificationcan beslow,sinceitrequiresthecalculationofthetestdatapoint’sdistancetoeveryotherdata point;sophisticatedimplementationshavemechanismsforpartiallyhandlingthis. Second,vanillak-NNcanperformpoorlywhentheamountofpredictorvariables becomestoolarge.Intheirisexample,weusedonlytwopredictors,whichcanbeplotted intwo-dimensionalspacewheretheEuclideandistanceisjustthe2-DPythagorean theoremthatwelearnedinmiddleschool.Aclassificationproblemwithnpredictorsis representedinn-dimensionalspace;theEuclideandistancebetweentwopointsinhigh dimensionalspacecanbeverylarge,evenifthedatapointsaresimilar.This,andother complicationsthatarisefrompredictiveanalyticstechniquesusingahigh-dimensional featurespaces,is,colloquially,knownasthecurseofdimensionality.Itisnotuncommon formedical,image,orvideodatatohavehundredsoreventhousandsofdimensions. Luckily,therearewaysofdealingwiththesesituations.Butlet’snotdwellthere. Logisticregression RememberwhenIsaid,athoroughunderstandingoflinearmodelswillpayenormous dividendsthroughoutyourcareerasananalystinthepreviouschapter?Well,Iwasn’t lying!Thisnextclassifierisaproductofageneralizationoflinearregressionthatcanact asaclassifier. Whatifweusedlinearregressiononabinaryoutcomevariable,representingdiabetesas1 andnotdiabetesas0?Weknowthattheoutputoflinearregressionisacontinuous prediction,butwhatif,insteadofpredictingthebinaryclass(diabetesornotdiabetes),we attemptedtopredicttheprobabilityofanobservationhavingdiabetes?Sofar,theideais totrainalinearregressiononatrainingsetwherethevariableswearetryingtopredictare adummy-coded0or1,andthepredictionsonanindependenttrainingsetareinterpreted asacontinuousprobabilityofclassmembership. Itturnsoutthisideaisnotquiteascrazyasitsounds—theoutcomeofthepredictionsare indeedproportionaltotheprobabilityofeachobservation’sclassmembership.Thebiggest problemisthattheoutcomeisonlyproportionaltotheclassmembershipprobabilityand can’tbedirectlyinterpretedasatrueprobability.Thereasonissimple:probabilityis, indeed,acontinuousmeasurement,butitisalsoaconstrainedmeasurement—itis boundedby0and1.Withregularoldlinearregression,wewilloftengetpredicted outcomesbelow0andabove1,anditisunclearhowtointerpretthoseoutcomes. Butwhatifwehadawayoftakingtheoutcomeofalinearregression(alinear combinationofbetacoefficientsandpredictors)andapplyingafunctiontoitthat constrainsittobebetween0and1sothatitcanbeinterpretedasaproperprobability? Luckily,wecandothiswiththelogisticfunction: whoseplotisdepictedinFigure9.6: Figure9.6:Thelogisticfunction Notethatnomatterwhatvalueofx(theoutputofthelinearregression)weuse—from negativeinfinitytopositiveinfinity—they(theoutputofthelogisticfunction)isalways between0and1.Nowwecanadaptlinearregressiontooutputprobabilities! Thefunctionthatweapplytothelinearcombinationofpredictorstochangeitintothe kindofpredictionwewantiscalledtheinverselinkfunction.Thefunctionthattransforms thedependentvariableintoavaluethatcanbemodeledusinglinearregressionisjust calledthelinkfunction.Inlogisticregression,thelinkfunction(whichistheinverseofthe inverselinkfunction,thelogisticfunction)iscalledthelogitfunction. Beforewegetstartedusingthispowerfulideaonourdata,therearetwootherproblems thatwemustcontendwith.Thefirstisthatwecan’tuseordinaryleastsquarestosolvefor thecoefficientsanymore,becausethelinkfunctionisnon-linear.Moststatisticalsoftware solvesthisproblemusingatechniquecalledMaximumLikelihoodEstimation(MLE) instead,thoughthereareotheralternatives. Thesecondproblemisthatanassumptionoflinearregression(ifyourememberfromlast chapter)isthattheerrordistributionisnormallydistributed.Inthecontextoflinear regression,thisdoesn’tmakesense,becauseitisabinarycategoricalvariable.So,logistic regressionmodelstheerrordistributionasaBernoullidistribution(orabinomial distribution,dependingonhowyoulookatit). Note GeneralizedLinearModel(GLM) Ifyouaresurprisedthatlinearregressioncanbegeneralizedenoughtoaccommodate classification,preparetobeastonishedbygeneralizedlinearmodels! GLMsareageneralizationofregularlinearregressionthatallowforotherlinkfunctions tomapfromlinearmodeloutputtothedependentvariable,andothererrordistributionsto describetheresiduals.Inlogisticregression,thelinkfunctionanderrordistributionisthe logitandbinomialrespectively.Inregularlinearregression,thelinkfunctionisthe identityfunction(afunctionthatreturnsitsargumentunchanged),andtheerror distributionisthenormaldistribution. Besidesregularlinearregressionandlogisticregression,therearestillotherspeciesof GLMthatuseotherlinkfunctionsanderrordistributions.AnothercommonGLMis Poissonregression,atechniquethatisusedtopredict/modelcountdata(numberoftraffic stops,numberofredcards,andsoon),whichusesthelogarithmasthelinkfunctionand thePoissondistributionasitserrordistribution.Theuseoftheloglinkfunctionconstrains theresponsevariable(thedependentvariable)sothatitisalwaysabove0. Rememberthatweexpressedthet-testandANOVAintermsofthelinearmodel?Sothe GLMencompassesnotonlylinearregression,logisticregression,Poissonregression,and thelike,butitalsoencompassest-tests,ANOVA,andtherelatedtechniquecalled ANCOVA(AnalysisofCovariance).Prettycool,eh?! UsinglogisticregressioninR Performinglogisticregression—anadvancedandwidelyusedclassificationmethod— couldscarcelybeeasierinR.Tofitalogisticregression,weusethefamiliarglmfunction. Thedifferencenowisthatwe’llbespecifyingourownerrordistributionandlinkfunction (theglmcallsoflastchapterassumedwewantedtheregularlinearregressionerror distributionandlinkfunction,bydefault).Thesearespecifiedinthefamilyargument: >model<-glm(diabetes~.,data=PID,family=binomial(logit)) Here,webuildalogisticregressionusingallavailablepredictorvariables. Youmayalsoseelogisticregressionsbeingperformedwherethefamilyargumentlooks likefamily="binomial"orfamily=binomial()—it’sallthesamething,Ijustlikebeing moreexplicit. Let’slookattheoutputfromcallingsummaryonthemodel. >summary(model) Call: glm(formula=diabetes~.,family=binomial(logit),data=PID) DevianceResiduals: Min1QMedian3QMax -2.5566-0.7274-0.41590.72672.9297 Coefficients: EstimateStd.ErrorzvaluePr(>|z|) (Intercept)-8.40469640.7166359-11.728<2e-16*** pregnant0.12318230.03207763.8400.000123*** glucose0.03516370.00370879.481<2e-16*** pressure-0.01329550.0052336-2.5400.011072* ... Theoutputissimilartothatofregularlinearregression;forexample,westillgetestimates ofthecoefficientsandassociatedp-values.Theinterpretationofthebetacoefficients requiresalittlemorecarethistimearound,though.Thebetacoefficientofpregnant, 0.123,meansthataoneunitincreaseinpregnant(anincreaseinthenumberoftimes beingpregnantbyone)isassociatedwithanincreaseofthelogarithmoftheoddsofthe observationbeingdiabetic.Ifthisisconfusing,concentrateonthefactthatifthe coefficientispositive,ithasapositiveimpactonprobabilityofthedependentvariable, andifthecoefficientisnegative,ithasanegativeimpactontheprobabilityofthebinary outcome.Whetherpositivemeanshigherprobabilityofdiabetesorhigherprobabilityof notdiabetes‘dependsonhowyourbinarydependentvariableisdummy-coded. Tofindthetrainingsetaccuracyofourmodel,wecanusetheaccuracyfunctionwewrote fromthelastsection.Inordertouseitcorrectly,though,weneedtoconvertthe probabilitiesintoclasslabels,asfollows: >predictions<-round(predict(model,type="response")) >predictions<-ifelse(predictions==1,"pos","neg") >accuracy(predictions,PID$diabetes) [1]0.7825521 Cool,wegeta78%accuracyonthetrainingdata,butremember:ifweoverfit,our trainingsetaccuracywillnotbeareliableestimateofperformanceonanindependent dataset.Inordertotestthismodel’sgeneralizability,let’sperformk-foldcross-validation, justlikeinthepreviouschapter! >set.seed(3) >library(boot) >cv.err<-cv.glm(PID,model,K=5) >cv.err$delta[2] [1]0.154716 >1-cv.err$delta[2] [1]0.845284 Wow,ourCV-estimatedaccuracyrateis85%!Thisindicatesthatitishighlyunlikelythat weareoverfitting.IfyouarewonderingwhywewereusingallavailablepredictorsafterI saidthatdoingsowasdangerousbusinessinthelastchapter,it’sbecausethoughtheydo makethemodelmorecomplex,theextrapredictorsdidn’tcausethemodeltooverfit. Finally,let’stestthemodelontheindependenttestsetsothatwecancomparethis model’saccuracyagainstk-NN’s: >predictions<-round(predict(model,type="response", +newdata=test)) >predictions<-ifelse(predictions==1,"pos","neg") >accuracy(predictions,test[,9])#78% [1]0.7792208 Nice!A78%accuracyrate! Itlookslikelogisticregressionmayhavegivenusaslightimprovementoverthemore flexiblek-NN.Additionally,themodelgivesusatleastalittletransparencyintowhyeach observationisclassifiedthewayitis—aluxurynotavailabletousviak-NN. Beforewemoveon,it’simportanttodiscusstwolimitationsoflogisticregression. Thefirstisthatlogisticregressionproperdoesnothandlenon-binarycategorical variables—variableswithmorethantwolevels.Thereexistsageneralizationof logisticregression,calledmultinomialregression,thatcanhandlethissituation,but thisisvastlylesscommonthanlogisticregression.Itis,therefore,morecommonto seeanotherclassifierbeingusedforanon-binaryclassificationproblem. Thelastlimitationoflogisticregressionisthatitresultsinalineardecisionboundary. Thismeansthatifabinaryoutcomeisnoteasilyseparatedbyaline,plane,or hyperplane,thenlogisticregressionmaynotbethebestroute.Mayintheprevious sentenceisitalicizedbecausetherearetricksyoucanusetogetlogisticregressionto spitoutanon-lineardecisionboundary—sometimes,ahighperformingone—as we’llseeinthesectiontitledChoosingaclassifier. Decisiontrees Wenowmoveontooneoftheeasilyinterpretableandmostpopularclassifiersthereare outthere:thedecisiontree.Decisiontrees—whichlooklikeanupsidedowntreewiththe trunkontopandtheleavesonthebottom—playanimportantroleinsituationswhere classificationdecisionshavetobetransparentandeasilyunderstoodandexplained.Italso handlesbothcontinuousandcategoricalpredictors,outliers,andirrelevantpredictors rathergracefully.Finally,thegeneralideasbehindthealgorithmsthatcreatedecisiontrees arequiteintuitive,thoughthedetailscansometimesgethairy. Figure9.7depictsasimpledecisiontreedesignedtoclassifymotorvehiclesintoeither motorcycles,golfcarts,orsedans. Figure9.7:Asimpleandillustrativedecisiontreethatclassifiesmotorvehiclesintoeither motorcycles,golfcarts,andsedans Thisisarathersimpledecisiontreewithonlythreeleaves(terminalnodes)andtwo decisionpoints.Notethatthefirstdecisionpointis(a)onabinarycategoricalvariable, and(b)resultsinoneterminalnode,motorcycle.Theotherbranchcontainstheother decisionpoint,acontinuousvariablewithasplitpoint.Thissplitpointwaschosen carefullybythedecisiontree-creatingalgorithmtoresultinthemostinformativesplit— theonethatbestclassifiestherestoftheobservationsasmeasuredbythemisclassification rateofthetrainingdata. Note Actually,inmostcases,thedecisiontree-creatingalgorithmdoesn’tchooseasplitthat resultsinthelowestmisclassificationrateofthetrainingdata,butchoosesonthatwhich minimizeseithertheGinicoefficientorcrossentropyoftheremainingtraining observations.Thereasonsforthisaretwo-fold:(a)boththeGinicoefficientandcross entropyhavemathematicalpropertiesthatmakethemmoreeasilyamendabletonumerical optimization,and(b)itgenerallyresultsinafinaltreewithlessbias. Theoverallideaofthedecisiontree-growingalgorithm,recursivesplitting,issimple: 1. Step1:Chooseavariableandsplitpointthatresultsinthebestclassification outcomes. 2. Step2:Foreachoftheresultingbranches,checktoseeifsomestoppingcriteriais met.Ifso,leaveitalone.Ifnot,moveontonextstep. 3. Step3:RepeatStep1onthebranchesthatdonotmeetthestoppingcriteria. Thestoppingcriterionisusuallyeitheracertaindepth,whichthetreecannotgrowpast,or aminimumnumberofobservations,forwhichaleafnodecannotfurtherclassify.Bothof thesearehyper-parameters(alsocalledtuningparameters)ofthedecisiontreealgorithm —justlikethekink-NN—andmustbefiddledwithinordertoachievethebestpossible decisiontreeforclassifyinganindependentdataset. Adecisiontree,ifnotkeptincheck,cangrosslyoverfitthedata—returninganenormous andcomplicatedtreewithaminimumleafnodesizeof1—resultinginanearlybias-less classificationmechanismwithprodigiousvariance.Topreventthis,eitherthetuning parametersmustbechosencarefullyorahugetreecanbebuiltandcutdowntosize afterward.Thelattertechniqueisgenerallypreferredandis,quiteappropriately,called pruning.Themostcommonpruningtechniqueiscalledcostcomplexitypruning,where complexpartsofthetreethatprovidelittleinthewayofclassificationpower,asmeasured byimprovementofthefinalmisclassificationrate,arecutdownandremoved. Enoughtheory—let’sgetstarted!First,we’llgrowafulltreeusingthePIDdatasetand plottheresult: >library(tree) >our.big.tree<-tree(diabetes~.,data=training) >summary(our.big.tree) Classificationtree: tree(formula=diabetes~.,data=training) Variablesactuallyusedintreeconstruction: [1]"glucose""age""mass""pedigree""triceps""pregnant" [7]"insulin" Numberofterminalnodes:16 Residualmeandeviance:0.7488=447.8/598 Misclassificationerrorrate:0.184=113/614 >plot(our.big.tree) >text(our.big.tree) TheresultingplotisdepictedinFigure9.10. Figure9.8:Anunprunedandcomplexdecisiontree Thepowerofadecisiontree—whichisusuallynotcompetitivewithotherclassification mechanisms,accuracy-wise—isthattherepresentationofthedecisionrulesare transparent,easytovisualize,andeasytoexplain.Thistreeisratherlargeandunwieldy, whichhindersitsabilitytobeunderstood(ormemorized)ataglance.Additionally,forall itscomplexity,itonlyachievesan81%accuracyrateonthetrainingdata(asreportedby thesummaryfunction). Wecan(andwill)dobetter!Next,wewillbeinvestigatingtheoptimalsizeofthetree employingcross-validation,usingthecv.treefunction. >set.seed(3) >cv.results<-cv.tree(our.big.tree,FUN=prune.misclass) >plot(cv.results$size,cv.results$dev,type="b") Intheprecedingcode,wearetellingthecv.treefunctionthatwewanttopruneourtree usingthemisclassificationrateasourobjectivemetric.Then,weareplottingtheCVerror rate(dev)andafunctionoftreesize(size). Figure9.9:Aplotcross-validatedmisclassificationerrorasafunctionoftreesize. Observethattreeofsizeoneperformsterribly,andthattheerrorratesteeplydeclines beforerisingslightlyasthetreeisoverfitandlargesizes. Asyoucanseefromtheoutput(showninFigure9.9),theoptimalsize(numberof terminalnodes)ofthetreeseemstobefive.However,atreeofsizethreeisnotterribly lessperformantthanatreeofsizefive;so,foreaseofvisualization,interpretation,and memorization,wewillbeusingafinaltreewiththreeterminalnodes.Toactuallyperform thepruning,wewillbeusingtheprune.misclassfunction,whichtakesthesizeofthe treeasanargument. >pruned.tree<-prune.misclass(our.big.tree,best=3) >plot(pruned.tree) >text(pruned.tree) >#let'stestitsaccuracy >pruned.preds<-predict(pruned.tree,newdata=test,type="class") >accuracy(pruned.preds,test[,9])#71% [1]0.7077922 ThefinaltreeisdepictedinFigure9.10. Figure9.10:Simplerdecisiontreewiththesametestingsetperformanceasthetreein Figure9.8 Rad!Atreesosimpleitcanbeeasilymemorizedbymedicalpersonnelandachievesthe sametesting-setaccuracyastheunwieldytreeinfigure9.8:71%!Nowtheaccuracyrate, byitself,isnothingtowritehomeabout,particularlybecausethenaïveclassifierachieves a65%accuracyrate.Nevertheless,thefactthatasignificantlybetterclassifiercanbebuilt fromtwosimplerules—closelyfollowingthelogicphysiciansemploy,anyway—iswhere decisiontreeshaveahugeleguprelativetoothertechniques.Further,wecouldhave bumpedupthisaccuracyratewithmoresamplesandmorecarefulhyper-parameter tuning. Randomforests ThefinalclassifierthatwewillbediscussinginthischapteristheaptlynamedRandom Forestandisanexampleofameta-techniquecalledensemblelearning.Theideaand logicbehindrandomforestsfollowsthusly: Giventhat(unpruned)decisiontreescanbenearlybias-lesshighvarianceclassifiers,a methodofreducingvarianceatthecostofamarginalincreaseofbiascouldgreatly improveuponthepredictiveaccuracyofthetechnique.Onesalientapproachtoreducing varianceofdecisiontreesistotrainabunchofunpruneddecisiontreesondifferent randomsubsetsofthetrainingdata,samplingwithreplacement—thisiscalledbootstrap aggregatingorbagging.Attheclassificationphase,thetestobservationisrunthroughall ofthesetrees(aforest,perhaps?),andeachresultingclassificationcastsavoteforthefinal classificationofthewholeforest.Theclasswiththehighestnumberofvotesisthewinner. Itturnsoutthattheconsensusamongmanyhigh-variancetreesonbootstrappedsubsetsof thetrainingdataresultsinasignificantaccuracyimprovementandvastlydecreased variance. Note Trèsbienensemble! Baggingisoneexampleofanensemblemethod—ameta-techniquethatusesmultiple classifierstoimprovepredictiveaccuracy.Nearlybias-less/high-varianceclassifiersare theonesthatseemtobenefitthemostfromensemblemethods.Additionally,ensemble methodsareeasiesttousewithclassifiersthatarecreatedandtrainedrapidly,since methodipsofactoreliesonalargenumberofthem.Decisiontreesfitallofthese characteristics,andthisaccountsforwhybaggedtreesandrandomforestsarethemost commonensemblelearninginstruments. Sofar,whatwehavechronicleddescribesatechniquecalledbaggedtrees.Butrandom forestshaveonemoretrickuptheirsleeves!Observingthatthevariancecanbefurther reducedbyforcingthetreestobelesssimilar,randomforestsdifferfrombaggedtreesby forcingthetreetoonlyuseasubsetofitsavailablepredictorstosplitoninthegrowing phase. Manypeoplebeginconfusedastohowdeliberatelyreducingtheefficacyofthe componenttreescanpossiblyresultinamoreaccurateensemble.Toclearthisup, considerthatafewveryinfluentialpredictorswilldominatetheexpressionofthetrees, evenifthesubsetscontainlittleoverlap.Byconstrainingthenumberofpredictorsatree canuseoneachsplittingphase,amorediversecropoftreesisbuilt.Thisresultsina forestwithlowervariancethanaforestwithnoconstraints. Randomforestsarethemoderndarlingofclassifiers—andforgoodreason.Forone,they areoftenextraordinarilyaccurate.Second,sincerandomforestsuseonlytwohyperparameters(thenumberoftreestouseintheforestandthenumberofpredictorstouseat eachstepofthesplittingprocess),theyareveryeasytocreate,andrequirelittleintheway ofhyper-parametertuning.Third,itisextremelydifficultforarandomforesttooverfit, anditdoesn’thappenveryoftenatall,inpractice.Forexample,increasingthenumberof treesthatmakeuptheforestdoesnotcausetheforesttooverfit,andfiddlingwiththe number-of-predictorshyper-parametercan’tpossiblyresultinaforestwithahigher variancethanthatofthecomponenttreethatoverfitsthemost. Onelastawesomepropertyoftherandomforestisthatthetrainingerrorratethatitreports isanearlyunbiasedestimatorcross-validatederrorrate.Thisisbecausethetrainingerror rate,atleastthatRreportsusingthepredictfunctiononarandomForestwithno newdataargument,istheaverageerrorrateoftheclassifiertestedonalltheobservations thatwerekeptoutofthetrainingsampleateachstageofthebootstrapaggregation. Becausethesewereindependentobservations,andnotusedfortraining,itclosely approximatestheCVerrorrate.Theerrorratereportedontheremainingobservationsleft outofthesampleateverybaggingstepiscalledtheOut-Of-Bag(OOB)errorrate. Theprimarydrawbacktorandomforestsisthatthey,tosomeextent,revokethechief benefitofdecisiontrees:theirinterpretability;itisfarhardertovisualizethebehaviorofa randomforestthanitisforanyofthecomponenttrees.Thisputstheinterpretabilityof randomforestssomewherebetweenlogisticregression(whichismarginallymore interpretable)andk-NN(whichislargelyun-interpretable). Atlonglast,let’susearandomforestonourdatasettoclassifyobservationsasbeing positiveornegativefordiabetes! >library(randomForest) >forest<-randomForest(diabetes~.,data=training, +importance=TRUE, +ntree=2000, +mtry=5) >accuracy(predict(forest),training[,9]) [1]0.7654723 >predictions<-predict(forest,newdata=test) >accuracy(predictions,test[,9]) [1]0.7727273 Inthisincantation,wesetthenumberoftrees(ntree)toanarbitrarilyhighnumberand setthenumberofpredictors(mtry)to5.Thoughitisnotshownabove,IusedtheOOB errorratetoguideinthechoosingofthishyper-parameter.Hadweleftitblank,itwould havedefaultedtothesquarerootofthenumberoftotalpredictors. Asyoucanseefromtheoutputofouraccuracyfunction,therandomforestiscompetitive withtheperformanceofourhighestperforming(onthisdataset,atleast)classifier: logisticregression.Onotherdatasets,withothercharacteristics,randomforestssometimes blowthecompetitionoutofthewater. Choosingaclassifier Thesearejustfourofthemostpopularclassifiersoutthere,buttherearemanymoreto choosefrom.Althoughsomeclassificationmechanismsperformbetteronsometypesof datasetsthanothers,itcanbehardtodevelopanintuitionforexactlytheonestheyare suitablefor.Inordertohelpwiththis,wewillbeexaminingtheefficacyofourfour classifiersonfourdifferenttwo-dimensionalmade-updatasets—eachwithavastly differentoptimaldecisionboundary.Indoingso,wewilllearnmoreaboutthe characteristicsofeachclassifierandhaveabettersenseofthekindsofdatatheymightbe bettersuitedfor. ThefourdatasetsaredepictedinFigure9.11: Figure9.11:Aplotdepictingtheclasspatternsofourfourillustrativeandcontrived datasets Theverticaldecisionboundary Thefirstcontriveddatasetwewillbelookingatistheoneonthetop-leftpanelofFigure 9.11.Thisisarelativelysimpleclassificationproblem,because,justbyvisualinspection, youcantellthattheoptimaldecisionboundaryisaverticalline.Let’sseeeachofour classifiersfaironthisdataset: Figure9.12:Aplotofthedecisionboundariesofourfourclassifiersonourfirstcontrived dataset Asyoucansee,allofourclassifiersperformedwellonthissimpledataset;allofthe methodsfindanappropriatestraightverticallinethatismostrepresentativeoftheclass division.Ingeneral,logisticregressionisgreatforlineardecisionboundaries.Decision treesalsoworkwellforstraightdecisionboundaries,aslongastheboundariesare orthogonaltotheaxes!Observethenextdataset. Thediagonaldecisionboundary Theseconddatasetsportsanoptimaldecisionboundarythatisadiagonalline—onethatis notorthogonaltotheaxes.Here,westarttoseesomecoolbehaviorfromcertain classifiers. Figure9.13:Aplotofthedecisionboundariesofourfourclassifiersonoursecond contriveddataset Thoughallfourclassifierswerereasonablyeffectiveinthisdataset’sclassification,we starttoseeeachoftheclassifiers’personalitycomeout.First,thek-NNcreatesa boundarythatcloselyapproximatestheoptimalone.Thelogisticregression,amazingly, throwsaperfectlinearboundaryattheexactrightspot. Thedecisiontree’sboundaryiscurious;itismadeupofperpendicularzig-zags.Though theoptimaldecisionboundaryislinearintheinputspace,thedecisiontreecan’tcapture itsessence.Thisisbecausedecisiontreesonlysplitonafunctionofonevariableatatime. Thus,datasetswithcomplexinteractionsmaynotbethebestonestoattackwitha decisiontree. Finally,therandomforest,beingcomposedofsufficientlyvarieddecisiontrees,isableto capturethespiritoftheoptimalboundary. Thecrescentdecisionboundary Thisthirddataset,depictedinthebottom-leftpanelofFigure9.11,exhibitsaverynonlinearclassificationpattern: Figure9.14:Aplotofthedecisionboundariesofourfourclassifiersonourthird contriveddataset Intheprecedingfigure,ourtopperformersarek-NN—whichishighlyeffectivewithnonlinearboundaries—andrandomforest—whichissimilarlyeffective.Thedecisiontreeisa littletoojaggedtocompeteatthetoplevel.Buttherealloserhereislogisticregression. Becauselogisticregressionreturnslineardecisionboundaries,itisineffectiveat classifyingthesedata. Tobefair,withalittlefinesse,logisticregressioncanhandletheseboundaries,too,as we’llseeinthelastexample.However,inhighlynon-linearsituations,wherethenatureof thenon-linearboundaryisunknown—orunknowable—logisticregressionisoften outperformedbyotherclassifiersthatnativelyhandlethesesituationswithease. Thecirculardecisionboundary Thelastdatasetwewillbelookingat,likethepreviousone,containsanon-linear classificationpattern. Figure9.15:Aplotofthedecisionboundariesofourfourclassifiersonourfourth contriveddataset Again,justlikeinthelastcase,thewinnersarek-NNandrandomforest,followedbythe decisiontreewithitsjaggededges.And,again,thelogisticregressionunproductively throwsalinearboundaryatadistinctivelynot-linearpattern.However,statingthatlogistic regressionisunsuitableforproblemsofthistypeisbothnegligentanddeadwrong. Withaslightchangeintheincantationofthelogisticregression,thewholegameis changed,andlogisticregressionbecomestheclearwinner: >model<-glm(factor(dep.var)~ind.var1+ +I(ind.var1^2)+ind.var2+I(ind.var2^2), +data=this,family=binomial(logit)) Figure9.16:Asecond-order(quadratic)logisticregressiondecisionboundary Intheprecedingfigure,insteadofmodelingthebinarydependentvariable(dep.var)asa linearcombinationofsolelythetwoindependentvariables(ind.var1andind.var2),we modelitasafunctionofthosetwovariablesandthosetwovariablessquared.Theresultis stillalinearcombinationoftheinputs(beforetheinverselinkfunction),butnowthe inputscontainnon-lineartransformationsoftheotheroriginalinputs.Thisgeneral techniqueiscalledpolynomialregressionandcanbeusedtocreateawidevarietyofnonlinearboundaries.Inthisexample,justsquaringtheinputs(resultinginaquadratic polynomial)outputsaclassificationcirclethatexactlymatchestheoptimaldecision boundary,asyoucanseeinFigure9.16.Cubingtheoriginalinputs(creatingacubic polynomial)sufficestodescribetheboundaryinthepreviousexample. Infact,alogisticregressioncontainingpolynomialtermsofarbitrarilylargeordercanfit anydecisionboundary—nomatterhownon-linearandcomplicated.Careful,though! Usinghighorderpolynomialsisagreatwaytomakesureyouoverfityourdata. Mygeneraladviceistoonlyusepolynomialregressionforcaseswhereyouknowapriori whatpolynomialformyourboundariestakeon—likeanellipse!Ifyoumustexperiment, keepacloseeyeonyourcross-validatederrorratetomakesureyouarenotfooling yourselfintothinkingthatyouaredoingtherightthingtakingonmoreandmore polynomialterms. Exercises Practisethefollowingexercisestogetafirmgraspontheconceptslearnedsofar: DidyounoticethatIputCVinitalicswhenIsaid,Usingk=27seemslikeasafebet asmeasuredbytheminimizationofCVerror?Didyouwonderwhy?I(quite deliberately)madeagaffeinchoosingthekinthek-NNfromFigure9.4.Mychoice wasn’twrong,perse,butmychoiceofkmayhavebeeninformedbydatathatshould havebeenunavailabletome.HowmighthaveIcommittedacommonbutserious errorinhyper-parametertuning?HowmightIhavedonethingsdifferently? Rememberthatwespentalongtimetalkingabouttheassumptionsoflinear regression?Incontrast,wespentvirtuallynotimediscussingtheassumptionsof logisticregression.Althoughlogisticregressionhaslessstringentassumptionsthan itscousin,itisnotassumption-free.Thinkaboutwhatsomeassumptionsoflogistic regressionmightbe.Confirmyoursuspicionsbydoingresearchontheweb.My omissionoftheassumptionswasnotoutoflaziness,and(again)itwasquite deliberate.Asyouprogressinyourcareerofadataanalyst,youwilloftencome acrossexcitingnewclassificationmethodsthatyouwill,nodoubt,wanttoputtouse rightaway.Atraitthatwillsetyouapartfromyourmoreimpulsivecolleaguesisone thatpromotescarefulexaminationandindependentresearchintowherethese techniquescouldgowrong. Youmaybesurprisedtolearnthatalloftheclassificationtechniquesthatwe discussedinthischaptercanbeadaptedforuseinregression(predictingcontinuous variables)!Theadaptationoflogisticregressionisobvious,butthinkabouthowyou mightadapttheothersforuseinthispurpose.Dosomeresearchintoit. TowhatextentcantherapiddismantlingoftheNewDealpoliciesafterthedeathof Rooseveltbefactoredintotheconcurrentriseofneoliberaleconomicideasand policiesofpost-warAmericanintellectualthought? Summary Atahighlevel,inthischapteryoulearnedaboutfourofthemostpopularclassifiersout there:k-NearestNeighbors,logisticregression,decisiontrees,andrandomforests.Not onlydidyoulearnthebasicsandmechanicsofthesefouralgorithms,butyousawhow easytheyweretoperforminR.Alongtheway,youlearnedaboutconfusionmatrices, hyper-parametertuning,andmaybeevenafewnewRincantations. Wealsovisitedsomemoregeneralideas;forexample,you’veexpandedyour understandingofthebias-variancetradeoff,sawhowtheGLMcanperformgreatfeats, andbecameacquaintedwithensemblelearningandbootstrapaggregation.It’salsomy hopethatyou’vedevelopedsomeintuitionastowhichclassifierstouseindifferent situations.Finally,giventhatwecouldn’tachieveperfectclassificationonourdiabetes dataset,Ihopethatyou’vegainedanappreciationfortheartanddifficultyof classification.Perhapsyou’veevencaughtthestatisticallearningbugandwanttotryto beatourperformanceinthischapter!Thatwouldbegreat!Therearecompetitionsonthe webforpeoplejustlikeyou—anditisagreatwaytohoneyourskills.This,forbetteror worse,concludesourunitonpredictiveanalytics.Inthefinalunit,wewillbediscussing someofthetrialsandtribulationsofdataanalysisasittendstogoinpractice.Staytuned! Chapter10.SourcesofData Theprevioustwounits(ConfirmatoryDataAnalysisandInferentialStatisticsand PredictiveAnalytics),havefocusedonteachingboththeoryandpracticeinidealdata scenarios,sothatourmoreacademicquestscanbedivorcedfromoutsideconcernsabout theveracityorformatofthedata.Tothisend,wedeliberatelystayedawayfromdatasets notalreadybuilt-intoRoravailablefromadd-onpackages.ButveryfewpeopleIknow getbyintheircareersusingRbynotimportinganydatafromsourcesoutsideRpackages. Well,weverybrieflytoucheduponhowtoloaddataintoR(theread.*commands)inthe veryfirstchapterofthisbook,didwenot?Soweshouldbeallset,right? Here’stherub:IknowalmostasfewpeoplethatcangetbyusingsimpleCSVsandtabdelimitedtextlocallywiththeprimaryread.*commandsascangetbynotusingoutside sourcesofdataatall!Theunfortunatefactisthatmanyintroductoryanalyticstextslargely disregardthisreality.Thisproducesmanywell-informednewanalystswhoare neverthelessstymiedontheirfirstattempttoapplytheirfreshknowledgeto“real-world data”.Inmyopinion,anytextthatpurportstobeapracticalresourcefordataanalysts cannotaffordtoignorethis. Luckily,duetolargelyundirectedandunplannedpersonalresearchIdoforblogpostsand myownedificationusingmotleycollectionsofpubliclyavailabledatasourcesinvarious formats,I—perhapsdelusionally—considermyselffairlyadeptinnavigatingthis criminallyoverlookedportionofpracticalanalytics.ItisthebodyoflessonsI’velearned duringthesewilddataadventuresthatI’dliketoimparttoyouinthisandthesubsequent chapter,dearreader. It’scommonfordatasourcestobenotonlydifficulttoload,forvariousreasons,buttobe difficulttoworkwithbecauseoferrors,junk,orjustgeneralidiosyncrasies.Becauseof this,thischapterandthenextchapter,Dealingwithmessydatawillhavealotincommon. Thischapterwillconcentratemoreongettingdatafromoutsidesourcesandgettingitinto asomewhatusableforminR.Thenextchapterwilldiscussparticularlycommongotchas whileworkingwithdatainanimperfectworld. Iappreciatethatnoteveryonehastheinterestorthetime-availabilitytogoonwildgoose huntsforpubliclyavailabledatatoanswerquestionsformedonawhim.Nevertheless,the techniquesthatwe’llbediscussinginthischaptershouldbeveryhelpfulinhandlingthe varietyofdataformatsthatyou’llhavetocontendwithinthecourseofyourworkor research.Additionally,havingthewherewithaltoemployfreelyavailabledataontheweb canbeindispensableforlearningnewanalyticsmethodsandtechnologies. Thefirstsourceofdatawe’llbelookingatisthatofthevenerablerelationaldatabase. RelationalDatabases Perhapsthemostcommonexternalsourceofdataisfromrelationaldatabases.Sincethis sectionisprobablyofinteresttoonlythosewhoworkwithdatabases—oratleastplanto —someknowledgeofthebasicsofrelationaldatabasesisassumed. OnewaytoconnecttodatabasesfromRistousetheRODBCpackage.Thisallowsoneto accessanydatabasethatimplementstheODBCcommoninterface(forexample, PostgreSQL,Access,Oracle,SQLite,DB2,andsoon).Amorecommonmethod—for whateverreason—istousetheDBIpackageandDBI-compliantdrivers. DBIisanRpackagethatdefinesageneralizedinterfaceforcommunicationbetween differentdatabasesandR.LikewithODBC,itallowsthesamecompliantSQLtorunon multipledatabases.TheDBIpackagealoneisnotsufficientforcommunicatingwithany particulardatabasefromR;inordertouseDBI,youmustalsoinstallandloadaDBIcompliantdriverforyourparticulardatabase.Packagesexistprovidingdriversformany RDBMSs.AmongthemareRPostgreSQL,RSQLite,RMySQL,andROracle. InordertomosteasilydemonstrateR/DBcommunication,wewillbeusingaSQLite database.Thiswillalsomosteasilyallowtheprudentreadertocreatetheexample databaseandfollowalong.TheSQLwe’llbeusingisstandard,soyoucanreallyuseany DByouwant,anyhow. Ourexampledatabasehastwocolumns:artistsandpaintings.Theartiststable containsauniqueintegerID,anartist’sname,andtheyeartheywereborn.The paintingstablecontainsauniqueintegerID,anartistID,thenameofthepainting,and itscompletiondate.TheartistIDinthepaintingstableisaforeignkeythatreferences theartistIDintheartisttable;thisishowthisdatabaselinkspaintingstotheir respectivepainters. Ifyouwanttofollowalong,usethefollowingSQLstatementstocreateandpopulatethe database.Ifyou’reusingSQLite,namethedatabaseart.db. CREATETABLEartists( artist_idINTEGERPRIMARYKEY, nameTEXT, born_onINTEGER ); CREATETABLEpaintings( painting_idINTEGERPRIMARYKEY, painting_artistINTEGER, painting_nameTEXT, year_completedINTEGER, FOREIGNKEY(painting_artist)REFERENCESartists(artist_id) ); INSERTINTOartists(name,born_on) VALUES("KaySage",1898), ("PietMondrian",1872), ("ReneMagritte",1898), ("ManRay",1890), ("Jean-MichelBasquiat",1960); INSERTINTOpaintings(painting_artist,painting_name,year_completed) VALUES(4,"OrquestaSinfonica",1916), (4,"LaFortune",1938), (1,"TommorowisNever",1955), (1,"TheAnswerisNo",1958), (1,"NoPassing",1954), (5,"BirdonMoney",1981), (2,"PlacedelaConcorde",1943), (2,"CompositionNo.10",1942), (3,"TheHumanCondition",1935), (3,"TheTreacheryofImages",1948), (3,"TheSonofMan",1964); ConfirmforyourselfthatthefollowingSQLcommandsyieldtheappropriateresultsby typingthemintothesqlite3commandlineinterface. SELECT*FROMartists; -------------------------------1|KaySage|1898 2|PietMondrian|1872 3|ReneMagritte|1898 4|ManRay|1890 5|Jean-MichelBasquiat|1960 SELECT*FROMpaintings; -------------------------------------1|4|OrquestaSinfonica|1916 2|4|LaFortune|1938 3|1|TommorowisNever|1955 4|1|TheAnswerisNo|1958 5|1|NoPassing|1954 6|5|BirdonMoney|1981 7|2|PlacedelaConcorde|1943 8|2|CompositionNo.10|1942 9|3|TheHumanCondition|1935 10|3|TheTreacheryofImages|1948 11|3|TheSonofMan|1964 Forourfirstact,weloadthenecessarypackages,chooseourdatabasedriver,andconnect tothedatabase: library(DBI) library(RSQLite) sqlite<-dbDriver("SQLite") #wereadtheartsqlitedbfromthecurrent #workingdirectorywhichcanbegetandset #withgetwd()andsetwd(),respectively art_db<-dbConnect(sqlite,"./art.db") Again,weareusingsqliteforthisexample,butthisprocedureisapplicabletoallDBIcompliantdatabasedrivers. Let’snowrunaqueryagainstthisdatabase.Let’sgetalistofallthepaintingnamesand theirrespectiveartist’sname.Thiswillrequireajoinoperationbetweenthetwotables: result<-dbSendQuery(art_db, "SELECTpaintings.painting_name,artists.name FROMpaintingsINNERJOINartists ONpaintings.painting_artist=artists.artist_id;") response<-fetch(result) head(response) dbClearResult(result) ---------------------------------------------painting_namename 1OrquestaSinfonicaManRay 2LaFortuneManRay 3TommorowisNeverKaySage 4TheAnswerisNoKaySage 5NoPassingKaySage HereweusedthedbSendQueryfunctiontosendaquerytothedatabase.Itsfirstand secondargumentswerethedatabasehandlevariable(fromthedbConnectfunction)and theSQLstatement,respectively.Westoreahandletotheresultinavariable.Next,the fetchfunctionretrievestheresponsefromthehandle.Bydefault,itwillretrieveall matchesfromthequery,thoughthiscanbelimitedbyspecifyingthenargument(see help("fetch")).Theresultofthefetchisthenstoredinthevariableresponse.response isanRdataframelikeanyother;wecandoanyoftheoperationswe’vealreadylearned withit.Finally,wecleartheresult,whichisgoodpractice,becauseitfreesresources. Foraslightlymoreinvolvedquery,let’strytofindtheaverage(mean)ageoftheartistsat theagetheywerewheneachofthepaintingswerecompleted.Thisstillrequiresajoin, butthistimeweareselectingpaintings.year_completedandartists.born_on. result<-dbSendQuery(art_db, "SELECTpaintings.year_completed,artists.born_on FROMpaintingsINNERJOINartists ONpaintings.painting_artist=artists.artist_id;") response<-fetch(result) head(response) dbClearResult(result) ---------------------------year_completedborn_on 119161890 219381890 319551898 419581898 519541898 619811960 Atthistime,row-wisesubtractionandaveragingcanbeperformedsimply: mean(response$year_completed-response$born_on) ----------[1]51.091 Finally,wecloseourconnectiontothedatabase: dbDisconnect(art_db) Whydidn’twejustdothatinSQL? Why,indeed.Althoughthisverysimpleexamplecouldhaveeasilyjustbeenwritteninto thelogicoftheSQLquery,formorecomplicateddataanalysisthissimplywon’tcutit. Unlessyouareusingareallyspecializeddatabase,manydatabasesaren’tpreparedfor certainmathematicalfunctionswithregardtonumericalaccuracy.Moreimportantly,most databasesdon’timplementadvancedmathfunctionsatall.Eveniftheydid,theyalmost certainlywouldn’tbeportablebetweendifferentRDBMSs.Thereisgreatmeritinhaving analyticslogicresideinRsothatif—forwhateverreason—youhavetoswitchdatabases, youranalysiscodewillremainunchanged. Note IfSQLisyourcupoftea,didyouknowyoucanusethesqldfpackagetoperform arbitrarySQLqueriesondata.frames? Thereisarisinginterestand(toalesserextent)needindatabasesthatdon’tadheretothe relationalparadigm.Theseso-calledNoSQLdatabasesincludetheimmenselypopular Hadoop/HDFS,MongoDB,CouchDB,Neo4j,andRedis,amongmanyothers.ThereareR packagesforcommunicatingtomostofthese,too,includingoneforeveryoneofthe databasesmentionedherebyname.Sincetheoperationofallofthesepackagesis idiosyncraticandheavilydependentonwhichspeciesofNoSQLthedatabaseinquestion belongsto,yourbestbetforlearninghowtousethisistoreadthehelppagesand/or vignettesforeachpackage. UsingJSON JavaScriptObjectNotation(JSON)isastandardizedhuman-readabledataformatthat playsanenormousroleincommunicationbetweenwebbrowserstowebservers.JSON wasoriginallyborneoutofaneedtorepresentarbitrarilycomplexdatastructuresin JavaScript—awebscriptinglanguage—butithassincegrownintoalanguageagnostic dataserializationformat. ItisacommonneedtoimportandparseJSONinR,particularlywhenworkingwithweb data.Forexample,itisverycommonforwebsitestoofferwebservicesthattakean arbitraryqueryfromawebbrowser,andreturntheresponseasJSON.Wewillseean exampleofthisveryusecaselaterinthissection. ForourfirstlookintoJSONparsingforR,we’llusethejsonlitepackagetoreadasmall JSONstring,whichserializessomeinformationaboutthebestmusicalactinhistory,The Beatles: library(jsonlite) example.json<-' { "thebeatles":{ "formed":1960, "members":[ { "firstname":"George", "lastname":"Harrison" }, { "firstname":"Ringo", "lastname":"Starr" }, { "firstname":"Paul", "lastname":"McCartney" }, { "firstname":"John", "lastname":"Lennon" } ] } }' the_beatles<-fromJSON(example.json) print(the_beatles) --------------------$thebeatles $thebeatles$formed [1]1960 $thebeatles$members firstnamelastname 1GeorgeHarrison 2RingoStarr 3PaulMcCartney 4JohnLennon WeusedthefromJSONfunctiontoreadinthestring.TheresultisanRlist,whose elements/attributescanbeaccessedviathe$operator,orthe[[doublesquarebracket function/operator.Forexample,wecanaccessthedatewhenTheBeatlesformed,inR,in thefollowingtwoways: the_beatles$thebeatles$formed the_beatles[["thebeatles"]][["formed"]] --------[1]1960 [1]1960 Note InR,alistisadatastructurethatiskindoflikeavector,butallowselementsofdiffering datatypes.Asinglelistmaycontainnumerics,strings,vectors,orevenotherlists! NowthatwehavetheverybasicsofhandlingJSONdown,let’smoveontousingitina non-trivialmanner! There’samusic/social-media-platformcalledhttp://www.last.fmthat/thatkindlyprovides awebserviceAPIthat’sfreeforpublicuse(aslongasyouabidebytheirreasonable terms).ThisAPI(ApplicationProgrammingInterface)allowsustoqueryvarious pointsofdataaboutmusicalartistsbycraftingspecialURLs.Theresultsoffollowing theseURLsareeitheraJSONorXMLpayload,whicharedirectlyconsumablefromR. Inthisnon-trivialexampleofusingwebdata,wewillbebuildingarudimentary recommendationsystem.Oursystemwillallowustosuggestnewmusictoaparticular personbasedonanartistthattheyalreadylike.Inordertodothis,wehavetoquerythe Last.fmAPItogatherallthetagsassociatedwithparticularartists.Thesetagsfunctiona lotlikegenreclassifications.Thesuccessofourrecommendationsystemwillbe predicatedontheassumptionthatmusicalartistswithoverlappingtagsaremoresimilarto eachotherthanartistswithdisparatetags,andthatsomeoneismorelikelytoenjoysimilar artiststhananarbitrarydissimilarartist. Here’sanexampleJSONexcerptoftheresultofqueryingtheAPIfortagsofaparticular artist: { "toptags":{ "tag":[ { "count":100, "name":"femalevocalists", "url":"http://www.last.fm/tag/female+vocalists" }, { "count":71, "name":"singer-songwriter", "url":"http://www.last.fm/tag/singer-songwriter" }, { "count":65, "name":"pop", "url":"http://www.last.fm/tag/pop" } ] } } Here,weonlycareaboutthenameofthetag—nottheURL,orthecountofoccasions Last.fmusersappliedeachtagtotheartist. Let’sfirstcreateafunctionthatwillconstructtheproperlyformattedqueryURLfora particularartist.TheLast.fmdeveloperwebsiteindicatesthattheformatis: http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist= <THE_ARTIST>&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json InordertocreatetheseURLsbaseduponarbitraryinput,wecanusethepaste0function toconcatenatethecomponentstrings.However,URLscan’thandlecertaincharacterssuch asspaces;inordertoconverttheartist’snametoaformatsuitableforaURL,we’lluse theURLencodefunctionfromthe(preloaded)utilspackage. URLencode("TheBeatles") ------[1]"The%20Beatles" Nowwehaveallthepiecestoputthisfunctiontogether: create_artist_query_url_lfm<-function(artist_name){ prefix<-"http://ws.audioscrobbler.com/2.0/? method=artist.gettoptags&artist=" postfix<-"&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json" encoded_artist<-URLencode(artist_name) return(paste0(prefix,encoded_artist,postfix)) } create_artist_query_url_lfm("DepecheMode") -------------------[1]"http://ws.audioscrobbler.com/2.0/? method=artist.gettoptags&artist=Depeche%20Mode&api_key=c2e57923a25c03f3d8b3 17b3c8622b43&format=json" Fantastic!Nowwemakethewebrequest,andparsetheresultingJSON.Luckily,the fromJSONfunctionthatwe’vebeenusingcantakeaURLandautomaticallymaketheweb requestforus.Let’sseewhatitlookslike: fromJSON(create_artist_query_url_lfm("DepecheMode")) ----------------------------------------$toptags $toptags$tag countnameurl 1100electronichttp://www.last.fm/tag/electronic 287newwavehttp://www.last.fm/tag/new+wave 35980shttp://www.last.fm/tag/80s 456synthpophttp://www.last.fm/tag/synth+pop ........ Neat-o!Ifyoutakeacloselookatthestructure,you’llseethatthetagnamesarestoredin thenameattributeofthetagattributeofthetoptagsattribute(whew!).Thismeanswecan extractjustthetagnameswith$toptags$tag$name.Let’swriteafunctionthatwilltakean artist’sname,andreturnalistofthetagsinavector. get_tag_vector_lfm<-function(an_artist){ artist_url<-create_artist_query_url_lfm(an_artist) json<-fromJSON(artist_url) return(json$toptags$tag$name) } get_tag_vector_lfm("DepecheMode") -----------------------------------------[1]"electronic""newwave""80s" [4]"synthpop""synthpop""seenlive" [7]"alternative""rock""british" ........ Next,wehavetogoaheadandretrievethetagsforallartists.Insteadofdoingthis(and probablyviolatingLast.fm’stermsofservice),we’lljustpretendthatthereareonlysix musicalartistsintheworld.We’llstorealloftheseartistsinalist.Thiswillmakeiteasy tousethelapplyfunctiontoapplytheget_tag_vector_lfmfunctiontoeachartistinthe list.Finally,we’llnamealltheelementsinthelistappropriately: our_artists<-list("KateBush","PeterTosh","Radiohead", "TheSmiths","TheCure","BlackUhuru") our_artists_tags<-lapply(our_artists,get_tag_vector_lfm) names(our_artists_tags)<-our_artists print(our_artists_tags) -------------------------------------$`KateBush` [1]"femalevocalists""singer-songwriter""pop" [4]"alternative""80s""british" ........ $`PeterTosh` [1]"reggae""rootsreggae""Rasta" [4]"roots""ska""jamaican" ........ $Radiohead [1]"alternative""alternativerock" [3]"rock""indie" ........ $`TheSmiths` [1]"indie""80s""post-punk" [4]"newwave""alternative""rock" ........ $`TheCure` [1]"post-punk""newwave""alternative" [4]"80s""rock""seenlive" ........ $`BlackUhuru` [1]"reggae""rootsreggae""dub" [4]"jamaica""roots""jamaican" ........ Nowthatwehavealltheartists’tagsstoredasalistofvectors,weneedsomewayof comparingthetaglists,andjudgethemforsimilarity. Thefirstideathatmaycometomindistocountthenumberoftagseachpairofartists haveincommon.Thoughthismayseemlikeagoodideaatfirstglance,considerthe followingscenario: ArtistAandartistBhavehundredsoftagseach,andtheysharethreetagsincommon; artistCandDeachhavetwotags,bothofwhicharemutuallyshared.Ournaivemetricfor similaritysuggeststhatartistsAandBaremoresimilarthanCandD(by50%).Ifyour intuitiontellsyouthatCandDaremoresimilar,though,wearebothinagreement. Tomakeoursimilaritymeasurecomportmorewithourintuition,wewillinsteadusethe Jaccardindex.TheJaccardindex(alsoJaccardcoefficient)betweensetsAandB, ,isgivenby: where isthesetintersection(thecommontags), isthesetunion(anunduplicatedlist ofallthetagsinbothsets),and thatset). isthesetX’scardinality(thenumberofelementsin Thismetrichastheattractivepropertythatitisnaturallyconstrained: Let’swriteafunctionthattakestwosets,andreturnstheJaccardindex.We’llemploythe built-infunctionsintersectandunion. jaccard_index<-function(one,two){ length(intersect(one,two))/length(union(one,two)) } Let’stryitonTheCureandRadiohead: jaccard_index(our_artists_tags[["Radiohead"]], our_artists_tags[["TheCure"]]) --------------[1]0.3333 Neat!Manualcheckingconfirmsthatthisistherightanswer. Thenextstepistoconstructasimilaritymatrix.Thisisa matrix(where isthe numberofartists)thatdepictsallthepairwisesimilaritymeasurements.Ifthisexplanation isconfusing,lookatthecodeoutputbeforereadingthefollowingcodesnippet: similarity_matrix<-function(artist_list,similarity_fn){ num<-length(artist_list) #initializeanumbynummatrixofzeroes sim_matrix<-matrix(0,ncol=num,nrow=num) #nametherowsandcolumnsforeasylookup rownames(sim_matrix)<-names(artist_list) colnames(sim_matrix)<-names(artist_list) #foreachrowinthematrix for(iin1:nrow(sim_matrix)){ #andeachcolumn for(jin1:ncol(sim_matrix)){ #calculatethatpair'ssimilarity the_index<-similarity_fn(artist_list[[i]], artist_list[[j]]) #andstoreitintherightplaceinthematrix sim_matrix[i,j]<-round(the_index,2) } } return(sim_matrix) } sim_matrix<-similarity_matrix(our_artists_tags,jaccard_index) print(sim_matrix) -------------------------------------------------------------- KateBushPeterToshRadioheadTheSmithsTheCureBlackUhuru KateBush1.000.050.310.250.210.04 PeterTosh0.051.000.020.030.030.33 Radiohead0.310.021.000.310.330.04 TheSmiths0.250.030.311.000.440.05 TheCure0.210.030.330.441.000.05 BlackUhuru0.040.330.040.050.051.00 Ifyou’refamiliarwithsomeofthesebands,you’llnodoubtseethatthesimilaritymatrix intheprecedingoutputmakesalotofprimafaciesense—itlookslikeourtheoryissound! Ifyounotice,thevaluesalongthediagonal(fromtheupper-leftpointtothelower-right) areall1.ThisisbecausetheJaccardindexoftwoidenticalsetsisalways1—andartists’ similaritywiththemselvesisalways1.Additionally,allthevaluesaresymmetricwith respecttothediagonal;whetheryoulookupPeterToshandRadioheadbycolumnand thenrow,orviceversa,thevaluewillbethesame(.02).Thispropertymeansthatthe matrixissymmetric.Thisisapropertyofallsimilaritymatricesusingsymmetric (commutative)similarityfunctions. Note Asimilar(andperhapsmorecommon)conceptisthatofadistancematrix(or dissimilaritymatrix).Theideaisthesame,butnowthevaluesthatarehigherwillreferto moremusicallydistantpairsofartists.Also,thediagonalwillbezeroes,sinceanartistis theleastmusicallydifferentfromthemselvesthananyotherartist.Ifallthevaluesofa similaritymatrixarebetween0and1(asisoftenthecase),youcaneasilymakeitintoa distancematrixbysubtracting1fromeveryelement.Subtractingfrom1againwillyield theoriginalsimilaritymatrix. Recommendationscannowbefurnished,forlistenersofoneofthebands,bysortingthat artist’scolumninthematrixinadescendingorder;forexample,ifauserlikesThe Smiths,butisunsurewhatotherbandssheshouldtrylisteningto: #TheSmithsarethefourthcolumn sim_matrix[order(sim_matrix[,4],decreasing=TRUE),4] ---------------------------------------------TheSmithsTheCureRadioheadKateBushBlackUhuru 1.000.440.310.250.05 PeterTosh 0.03 Ofcourse,arecommendationofTheSmithsforthisuserisnonsensical.Goingdownthe list,itlookslikearecommendationofTheCureisthesafestbet,thoughRadioheadand KateBushmayalsobefinerecommendations.BlackUhuruandPeterToshareunsafe betsifallweknowabouttheuser’safondnessforTheSmiths. XML XML,likeJSON,isanabsolutelyubiquitousformatfordatatransferovertheInternet.In additiontobeingusedontheweb,XMLisalsoapopulardataformatforapplication configurationfilesandthelist.Infact,newerMicrosoftOfficedocuments(withthe extension.docxor.xlsx)arestoredasXMLfiles. Here’swhatoursimpleBeatlesdatasetmaylooklikeinXML: example_xml1<-' <the_beatles> <formed>1960</formed> <members> <member> <first_name>George</first_name> <last_name>Harrison</last_name> </member> <member> <first_name>Ringo</first_name> <last_name>Starr</last_name> </member> <member> <first_name>Paul</first_name> <last_name>McCartney</last_name> </member> <member> <first_name>John</first_name> <last_name>Lennon</last_name> </member> </members> </the_beatles>' MuchlikeJSON,XMLisstoredinatreestructure—thisiscalledaDOM(Document ObjectModel)treeinXMLparlance.EachpieceofinformationinanXMLdocument— surroundedbynamesinanglebrackets—iscalledanelementornode.Inthehierarchical structure,subnodesarecalledchildren.Intheprecedingcode,formedisachildof the_beatles,andmemberisachildofmembers.Eachnodemayhavezeroormore childrenwhomayhavechildrennodesoftheirown.Forexample,themembersnodehas fourchildren,eachofwhomhavetwochildren,first_nameandlast_name.Thecommon parentofalltheelements(whetherdirectparentorgreat-great-grandparent)istheroot node,whichdoesn’thaveaparent. Note AswithJSON,XMLandXMLimportfunctionsisanenormoustopic.We’llonlybriefly coversomeofthemorecommonandbasicknow-howinthischapter.Fortunately,Rhasa built-inhelpanddocumentation.Forthispackage,help(package="XML")indicatesthat moredocumentationisavailableatthepackage’sURL:http://www.omegahat.org/RSXML WewillreadtheprecedingXMLwiththeXMLpackage.Ifyoudon’thaveitalready,make sureyouinstallit. library(XML) the_beatles<-xmlTreeParse(example_xml1) print(names(the_beatles)) ------------------[1]"doc""dtd" print(the_beatles$doc) --------------------$file [1]"<buffer>" $version [1]"1.0" $children $children$the_beatles <the_beatles> <formed>1960</formed> <members> <member> <first_name>George</first_name> <last_name>Harrison</last_name> </member> .......... </members> </the_beatles> attr(,"class") [1]"XMLDocumentContent" xmlTreeParsereadsandparsestheDOM,andstoresitasanRlist.Theactualcontentis storedinthechildrenattributeofthedocattribute.WecanaccesstheyearTheBeatles wereformedlikeso: print(xmlValue(the_beatles$doc$children$the_beatles[["formed"]])) ---------------------[1]"1960" Here,weusethexmlValuefunctiontoextractthevaluestoredintheformednode. Ifwewantedtogettothefirstnamesofallthemembers,wehavetostoretherootnodeof theDOM,anditerateoverthechildrenofthemembersnode.Inparticular,weusethe sapplyfunction(whichappliesafunctiontoeachelementofavector)overthechildren withafunctionthatreturnsthexmlvalueofthefirst_namenode.Concretely: root<-xmlRoot(the_beatles) sapply(xmlChildren(root[["members"]]),function(x){ xmlValue(x[["first_name"]]) }) ------------------------------------------membermembermembermember "George""Ringo""Paul""John" Thoughit’spossibletoworkwiththeDOMinthismanner,itismuchmorecommonto interrogateXMLusingXPath. XPathiskindoflikeanXMLquerylanguage—likeSQL,butforXML.Itallowsusto selectnodesthatmatchaparticularpatternorlocation.Formatching,itusespath expressionsthatidentifynodesbasedontheirname,location,orrelationshipswithother nodes. Thispowerfultoolalsocomeswithaproportionallysteeplearningcurve.Luckily,itis somewhateasytogetstarted.Inaddition,therearealotofgreattutorialsonline.The excellenttutorialthattaughtmeXPathisavailableat http://www.w3schools.com/xsl/xpath_intro.asp. TouseXPath,wehavetore-importtheXMLusingthexmlParse(notXMLTreeParse) function,whichusesadifferentoptimizedinternalrepresentation.Toreplicatetheresults ofthepreviouscodesnippetusingXPath,wearegoingtousethefollowingXPath statement: all_first_names<-"//member/first_name" Theprecedingstatementroughlytranslatesto“forallmembernodesanywhereoccurring anywhereinthedocument,getthechildnodenamedfirst_name“. the_beatles<-xmlParse(example_xml1) getNodeSet(the_beatles,all_first_names) -------[[1]] <first_name>George</first_name> [[2]] <first_name>Ringo</first_name> [[3]] <first_name>Paul</first_name> [[4]] <first_name>John</first_name> attr(,"class") [1]"XMLNodeSet" EquivalentXPathexpressionscouldalsobewrittenthus: getNodeSet(the_beatles,"//first_name") getNodeSet(the_beatles,"/the_beatles/members/member/first_name") AndjusttheXMLvaluesforeachnodecanbeextractedthus: sapply(getNodeSet(the_beatles,all_first_names),xmlValue) ------------------------------[1]"George""Ringo""Paul""John" ThereismorethanonewaytorepresentthesameinformationinXML.Thefollowing XMLisanotherwayofrepresentingthesamedataaboutTheBeatles.ThisusesXML attributesinsteadofnodesforformed,first_name,andlast_name: example_xml2<-' <the_beatlesformed="1990"> <members> <memberfirst_name="George"last_name="Harrison"/> <memberfirst_name="Richard"last_name="Starkey"/> <memberfirst_name="Paul"last_name="McCartney"/> <memberfirst_name="John"last_name="Lennon"/> </members> </the_beatles>' Inthiscase,retrievingavectorofallfirstnamescanbedoneusingthissnippet: sapply(getNodeSet(the_beatles,"//member[@first_name]"), function(x){xmlAttrs(x)[["first_name"]]}) ----------[1]"George""Richard""Paul""John" ItmayhelpunderstandingofXMLprocessinginRtouseitinareal-lifeexample. ThereisarepositoryofmusicinformationcalledMusicBrainz(http://musicbrainz.org). LikeLast.fm,thiswebsitekindlyallowscustomqueriesagainsttheirinfodatabase,and returnstheresultsinXMLformat. Wewillusethisservicetoextendtherecommendationsystemthatwecreatedjustusing tagsfromLast.fmbycombiningthemwithtagsfromMusicBrainz. Toquerythedatabaseforaparticularartist,theformatisasfollows: http://musicbrainz.org/ws/2/artist/?query=artist:<THE_ARTIST> Forexample,thequeryforKateBushis:http://musicbrainz.org/ws/2/artist/? query=artist:Kate%20Bush Ifyouvisitthatlink,you’llseethatitreturnsanXMLdocumentthatcontainsalistof artiststhatmatchthesearchtovaryingdegrees.Thelistcontains,amongothers,John Bush,ShellyBush,andBush.Luckily,thematchesareinorderofdescendingmatchiness and,foralltheartiststhatwe’llbeworkingwith,thecorrectartististhefirstartistinthe nodeartist-list. Incaseyoucan’tviewthelinkyourself,thefollowingisessentiallythestructureofit: <metadataxmlns="http://musicbrainz.org/ns/mmd-2.0#"> <artist-list> <artist> <name>KateBush</name> <tag-list> <tagcount="1"> <name>kent</name> </tag> <tagcount="1"> <name>english</name> </tag> <tagcount="3"> <name>british</name> </tag> </tag-list> </artist> <artist-list> </metadata> ThismeansthattheXPathexpressionsthatselectsallthetags(ofthefirstartist)isgiven by://artist[1]/tag-list/tag/name AswithJSON/Last.fm,let’swritethefunctionthat,foranygivenartist,returnsthe appropriatequeryURL: create_artist_query_url_mb<-function(artist){ encoded_artist<-URLencode(artist) return(paste0("http://musicbrainz.org/ws/2/artist/?query=artist:", encoded_artist)) } create_artist_query_url_mb("DepecheMode") ------[1]"http://musicbrainz.org/ws/2/artist/?query=artist:Depeche%20Mode" Now,let’swritethefunctionthatreturnsthelistoftagsforaparticularartist. Becausenothingisevereasy,theXPathmentionedintheprecedingcodewillnotworkas is.ThisisbecausetheMusicBrainzXMLusesanXMLnamespace.Thoughitmakesour job(marginally)harder,anXMLnamespaceisgenerallyagoodthing,becauseit eliminatesambiguitywhenreferringtoelementnamesbetweendifferentXMLdocuments whoseelementnamesarearbitrarilydefinedbythedeveloper. Astheresponsesuggests,thenamespaceisgivenbyhttp://musicbrainz.org/ns/mmd2.0#.InordertousethisinourtagextractionfunctionandXPathselecting,weneedto storeandnamethisnamespacefirst: ns<-"http://musicbrainz.org/ns/mmd-2.0#" names(ns)[1]<-"ns" NowwehaveallweneedtowritetheMusicBrainzcounterparttothe get_tag_vector_lfmfunction. get_tag_vector_mb<-function(an_artist,ns){ artist_url<-create_artist_query_url_mb(an_artist) the_xml<-xmlParse(artist_url) xpath<-"//ns:artist[1]/ns:tag-list/ns:tag/ns:name" the_nodes<-getNodeSet(the_xml,xpath,ns) return(unlist(lapply(the_nodes,xmlValue))) } get_tag_vector_mb("DepecheMode",ns) ------------------------------------[1]"electronica""postpunk""alternativedance" [4]"electronic""darkwave""britannique" ............ LikefromJSON,xmlParsehandlesURLsnatively. Let’sfinishthisup: our_artists<-list("KateBush","PeterTosh","Radiohead", "TheSmiths","TheCure","BlackUhuru") our_artists_tags_mb<-lapply(our_artists,get_tag_vector_mb,ns) names(our_artists_tags_mb)<-our_artists sim_matrix<-similarity_matrix(our_artists_tags_mb,jaccard_index) print(sim_matrix) ------KateBushPeterToshRadioheadTheSmithsTheCureBlackUhuru KateBush1.000.000.240.270.240.00 PeterTosh0.001.000.000.000.000.17 Radiohead0.240.001.000.230.230.00 TheSmiths0.270.000.231.000.380.00 TheCure0.240.000.230.381.000.00 BlackUhuru0.000.170.000.000.001.00 >sim_matrix[order(sim_matrix[,4],decreasing=TRUE),4] ------------------------------TheSmithsTheCureKateBushRadioheadPeterToshBlackUhuru 1.000.380.270.230.000.00 Thisyieldsresultsthatarequitesimilartotherecommendationsystemthatusestagsfrom onlyLast.fm.Personally,Iliketheformerbetter,buthowaboutwecombineboth?Wecan dothiseasilybytakingthesetintersectionofartists’tagsbetweenthetwoservices. for(iin1:length(our_artists_tags)){ the_artist<-names(our_artists_tags)[i] #the_artistnowholdsthecurrentartist'sname combined_tags<-union(our_artists_tags[[the_artist]], our_artists_tags_mb[[the_artist]]) our_artists_tags[[the_artist]]<-combined_tags } sim_matrix<-similarity_matrix(our_artists_tags,jaccard_index) print(sim_matrix) -------KateBushPeterToshRadioheadTheSmithsTheCureBlackUhuru KateBush1.000.040.290.240.190.03 PeterTosh0.041.000.010.030.030.29 Radiohead0.290.011.000.290.300.03 TheSmiths0.240.030.291.000.400.05 TheCure0.190.030.300.401.000.05 BlackUhuru0.030.290.030.050.051.00 Super! Otherdataformats OneofthingsthatmakeRgreatisthewealthofhigh-qualityadd-onpackages.Asyou mightexpect,therearemanyoftheseadd-onpackageswiththeabilitytoimportdataina multitudeofotherformats.Whetherit’sanarcanemarkup-language,aproprietarybinary file,excelspreadsheet,andsoon,thereisalmostcertainlyanRpackageoutthereforyou tohandleit.Buthowtofindthem? OnewayistobrowsethecommunitymaintainedCRANTaskViews(https://cran.rproject.org/web/views/).Ataskviewisawaytobrowseforpackagesrelatedtoa particulartopic,domain,orspecialinterest.ThegermaneTaskView,here,istheWeb TechnologiesTaskView(https://cran.r-project.org/web/views/WebTechnologies.html). You’llnoticethatjsonliteandtheXMLpackagearementionedonthefirstpage. Theeasiestwaytodiscoverthesepackages,though,isthroughyourfavoritewebbrowser. Forexample,ifyouarelookingforapackagetoimportYAMLdata(yetanotherdata serializationformat),ImightsearchRCRANpackageyaml.Ifyouuseasearchengine thattracksyou(don’tfightthesingularity),eventuallyasearchofonlyRyamlwillsuffice togetyouwhereyouneedtogo. Developingfastandreliableinformationretrievalskills(likesearch-engine-fu)isprobably oneofthemostvaluableassetsofastatisticalprogrammer—oranyprogrammer,forthat matter.Cultivatingtheseskillswillserveyouwell,dearreader. Onlinerepositories LookbacktotheWebTechnologiestaskviewwetalkedaboutintheprevioussection. ThereareatremendousamountofRpackagesspecificallydesignedtoimportdata directlyfromspecializedsourcesontheweb.Amongthesearepackagestosearchforand retrievethefulltextofacademicarticlesinthePublicLibraryofSciencejournals(rplos), searchforanddownloadthefulltextofWikipediaarticles(WikipediR),downloaddata aboutBerlinfromtheGermangovernment(BerlinData),interfacewiththeChromosome CountsDatabase(chromer),downloadhistoricalfinancialdata(quantmod),andaccessthe informationinthePubChemchemistrydatabase(rpubchem). Theseexamplesnotwithstanding,giventhattherearemanyhundredsofimmense repositoriesofpublicdata,itisfartoomuchtoexpecttheRcommunitytohaveapackage speciallybuiltforeverysingleone.Luckily,withtheabilitytohandlemanydifferentdata formatsunderourbelt,wecanjustdownloadandimportthedatafromtheserepositories ourselves.Thefollowingareafewofmyfavoriterepositories.Perhapssomeofthemwill havededicatedRpackagesforhandlingthembythetimeyoureadthis. data.gov:ahugerepositoryofdatafromtheUSgovernmentinavarietyofformats includingCSV,XML,andJSON data.gov.uk:theUK’sequivalentrepository data.worldbank.org:aspotfordatamadeavailablebytheWorldBankincludingdata onclimatechange,poverty,andaideffectiveness archive.ics.uci.edu/ml/:333(attimeofwriting)datasetsofvariouslengthandwidths fortestingstatisticallearningalgorithms www.cdc.gov/nchs/data_access/ftp_data.htm:somehealth-relateddatasetsmade availablebytheUSCenterofDiseaseControl Exercises Practicethefollowingexercisestorevisetheconceptslearnedinthischapter: Howdidwewastecomputationinthesimilarity_matrixfunction? BoththeLast.fmandtheMusicBrainzAPIhasacountvalueassociatedwitheach tag,whichcanbetakentorepresenttheextenttowhichthetagappliedtotheartist. Byignoringthisfield,inbothcases,weimplicitlyusedacountof1foreverytag— makingwell-fittingtagsjustasimportantasrelativelylesswell-fittingones.Rewrite thecodetotakecountintoaccount,andweigheachtagproportionallytoitscount value.Thiswillbechallenging,butitwillbeinvaluableforunderstandingthe material.ItwillalsoboostyourconfidenceasanRprogrammeronceyoufinish.Go you! Howelsemightyoubeabletoextendandimproveuponourragtagrecommender system? TheEfficientmarkethypothesispositsthatsincethepriceoffinancialinstruments reflectsalltherelevantinformationaboutitsvalueatanygiventime,itisimpossible toconsistentlybeatthemarket.Familiarizeyourselfwiththeweak,semi-strong,and strongformulationsofthishypothesis.Which,ifany,ofthecampsdoyoualign with?Why?Bespecific. Summary Thischapterbeganwithadiscussionofrelationdatabases.You’velearnedthattheDBI packagedefinesastandardinterfaceonwhichvariousdatabasedriversbuildupon.You thenlearnedhowtoquerythesetypesofdatabases,andloadtheresultsinR. Next,yougainedanappreciationforJSONandXML(right?!),andhowtoapproachthe importofdatafromtheseformats.Wethenputourchopstothetestbywieldingdata providedtousbytwodifferentwebserviceAPIs. IstealthilysnuckinsomefancynewRconstructsinthischapter.Forexample,priorto thischapter,we’veneverexplicitlyworkedwithlistsbefore. Finally,you’velearnedabouthowtolookforinformationbeyondwhichthischaptercan provide,andsomeotherplacesthatwecangetdatatoplayaroundwith. Inthenextchapter,wewon’tbetalkingabouthowtoloaddatafromdifferentsources— we’llbetalkingabouthowtodealwithdisorderlydatathatisalreadyloaded. Chapter11.DealingwithMessyData Asmentionedinthelastchapter,analyzingdataintherealworldoftenrequiressome know-howoutsideofthetypicalintroductorydataanalysiscurriculum.Forexample, rarelydowegetaneatlyformatted,tidydatasetwithnoerrors,junk,ormissingvalues. Rather,weoftengetmessy,unwieldydatasets. Whatmakesadatasetmessy?Differentpeopleindifferentroleshavedifferentideasabout whatconstitutesmessiness.Someregardanydatathatinvalidatestheassumptionsofthe parametricmodelasmessy.Othersseemessinessindatasetswithagrievouslyimbalanced numberofobservationsineachcategoryforacategoricalvariable.Someexamplesof thingsthatIwouldconsidermessyare: Manymissingvalues(NAs) Misspellednamesincategoricalvariables Inconsistentdatacoding Numbersinthesamecolumnbeingindifferentunits Mis-recordeddataanddataentrymistakes Extremeoutliers Sincethereareaninfinitenumberofwaysthatdatacanbemessy,there’ssimplyno chanceofenumeratingeveryexampleandtheirrespectivesolutions.Instead,wearegoing totalkabouttwotoolsthathelpcombatthebulkofthemessinessissuesthatIcitedjust now. Analysiswithmissingdata Missingdataisanotheroneofthosetopicsthatarelargelyignoredinmostintroductory texts.Probably,partofthereasonwhythisisthecaseisthatmanymythsaboutanalysis withmissingdatastillabound.Additionally,someoftheresearchintocutting-edge techniquesisstillrelativelynew.Amorelegitimatereasonforitsabsenceinintroductory textsisthatmostofthemoreprincipledmethodologiesarefairlycomplicated— mathematicallyspeaking.Nevertheless,theincredibleubiquityofproblemsrelatedto missingdatainreallifedataanalysisnecessitatessomebroachingofthesubject.This sectionservesasagentleintroductionintothesubjectandoneofthemoreeffective techniquesfordealingwithit. Acommonrefrainonthesubjectissomethingalongthelinesofthebestwaytodealwith missingdataisnottohaveany.It’struethatmissingdataisamessysubject,andthereare alotofwaystodoitwrong.It’simportantnottotakethisadvicetotheextreme,though. Inordertobypassmissingdataproblems,somehavedisallowedsurveyparticipants,for example,togoonwithoutansweringallthequestionsonaform.Youcancoercethe participantsinalongitudinalstudytonotdropout,too.Don’tdothis.Notonlyisit unethical,itisalsoprodigiouslycounter-productive;therearetreatmentsformissingdata, buttherearenotreatmentsforbaddata. Thestandardtreatmenttotheproblemofmissingdataistoreplacethemissingdatawith non-missingvalues.Thisprocessiscalledimputation.Inmostcases,thegoalof imputationisnottorecreatethelostcompleteddatasetbuttoallowvalidstatistical estimatesorinferencestobedrawnfromincompletedata.Becauseofthis,the effectivenessofdifferentimputationtechniquescan’tbeevaluatedbytheirabilitytomost accuratelyrecreatethedatafromasimulatedmissingdataset;theymust,instead,be judgedbytheirabilitytosupportthesamestatisticalinferencesaswouldbedrawnfrom theanalysisonthecompletedata.Inthisway,fillinginthemissingdataisonlyastep towardstherealgoal—theanalysis.Theimputeddatasetisrarelyconsideredthefinalgoal ofimputation. Therearemanydifferentwaysthatmissingdataisdealtwithinpractice—somearegood, somearenotsogood.Someareokayundercertaincircumstances,butnotokayinothers. Someinvolvemissingdatadeletion,whilesomeinvolveimputation.Wewillbrieflytouch onsomeofthemorecommonmethods.Theultimategoalofthischapter,though,istoget youstartedonwhatisoftendescribedasthegold-standardofimputationtechniques: multipleimputation. Visualizingmissingdata Inordertodemonstratethevisualizingpatternsofmissingdata,wefirsthavetocreate somemissingdata.Thiswillalsobethesamedatasetthatweperformanalysisonlaterin thechapter.Toshowcasehowtousemultipleimputationforasemi-realisticscenario,we aregoingtocreateaversionofthemtcarsdatasetwithafewmissingvalues: Okay,let’ssettheseed(fordeterministicrandomness),andcreateavariabletoholdour newmarreddataset. set.seed(2) miss_mtcars<-mtcars First,wearegoingtocreatesevenmissingvaluesindrat(about20percent),fivemissing valuesinthempgcolumn(about15percent),fivemissingvaluesinthecylcolumn,three missingvaluesinwt(about10percent),andthreemissingvaluesinvs: some_rows<-sample(1:nrow(miss_mtcars),7) miss_mtcars$drat[some_rows]<-NA some_rows<-sample(1:nrow(miss_mtcars),5) miss_mtcars$mpg[some_rows]<-NA some_rows<-sample(1:nrow(miss_mtcars),5) miss_mtcars$cyl[some_rows]<-NA some_rows<-sample(1:nrow(miss_mtcars),3) miss_mtcars$wt[some_rows]<-NA some_rows<-sample(1:nrow(miss_mtcars),3) miss_mtcars$vs[some_rows]<-NA Now,wearegoingtocreatefourmissingvaluesinqsec,butonlyforautomaticcars: only_automatic<-which(miss_mtcars$am==0) some_rows<-sample(only_automatic,4) miss_mtcars$qsec[some_rows]<-NA Now,let’stakealookatthedataset: >miss_mtcars mpgcyldisphpdratwtqsecvsamgearcarb MazdaRX421.06160.01103.902.62016.460144 MazdaRX4Wag21.06160.01103.902.87517.020144 Datsun71022.84108.0933.85NA18.611141 Hornet4Drive21.46258.0110NA3.21519.441031 HornetSportabout18.78360.0175NA3.44017.020032 Valiant18.1NA225.0105NA3.460NA1031 Great,nowlet’svisualizethemissingness. Thefirstwaywearegoingtovisualizethepatternofmissingdataisbyusingthe md.patternfunctionfromthemicepackage(whichisalsothepackagethatweare ultimatelygoingtouseforimputingourmissingdata).Ifyoudon’thavethepackage already,installit. >library(mice) >md.pattern(miss_mtcars) disphpamgearcarbwtvsqsecmpgcyldrat 12111111111110 4111111110111 2111111111011 3111111111101 3111110111111 2111111101111 1111111110102 1111111101012 1111111011012 2111111011102 1111111101003 0000033455727 Arow-wisemissingdatapatternreferstothecolumnsthataremissingforeachrow.This functionaggregatesandcountsthenumberofrowswiththesamemissingdatapattern. Thisfunctionoutputsabinary(0and1)matrix.Cellswitha1representnon-missingdata; 0srepresentmissingdata.Sincetherowsaresortedinanincreasing-amount-ofmissingnessorder,thefirstrowalwaysreferstothemissingdatapatterncontainingthe leastamountofmissingdata. Inthiscase,themissingdatapatternwiththeleastamountofmissingdataisthepattern containingnomissingdataatall.Becauseofthis,thefirstrowhasall1sinthecolumns thatarenamedafterthecolumnsinthemiss_mtcarsdataset.Theleft-mostcolumnisa countofthenumberofrowsthatdisplaythemissingdatapattern,andtheright-most columnisacountofthenumberofmissingdatapointsinthatpattern.Thelastrow containsacountofthenumberofmissingdatapointsineachcolumn. Asyoucansee,12oftherowscontainnomissingdata.Thenextmostcommonmissing datapatternistheonewithmissingjustmpg;fourrowsfitthispattern.Thereareonlysix rowsthatcontainmorethanonemissingvalue.Onlyoneoftheserowscontainsmorethan twomissingvalues(asshowninthesecond-to-lastrow). Asfarasdatasetswithmissingdatago,thisparticularonedoesn’tcontainmuch.Itisnot uncommonforsomedatasetstohavemorethan30percentofitsdatamissing.Thisdata setdoesn’tevenhit3percent. Nowlet’svisualizethemissingdatapatterngraphicallyusingtheVIMpackage.Youwill probablyhavetoinstallthis,too. library(VIM) aggr(miss_mtcars,numbers=TRUE) Figure11.1:TheoutputofVIM’svisualaggregationofmissingdata.Theleftplotshows theproportiononmissingvaluesforeachcolumn.Therightplotdepictstheprevalenceof row-wisemissingdatapatterns,likemd.pattern Ataglance,thisrepresentationshowsus,effortlessly,thatthedratcolumnaccountsfor thehighestproportionofmissingness,column-wise,followedbympg,cyl,qsec,vs,and wt.Thegraphicontherightshowsusinformationsimilartothatoftheoutputof md.pattern.Thisrepresentation,though,makesiteasiertotellifthereissomesystematic patternofmissingness.Thebluecellsrepresentnon-missingdata,andtheredcells representmissingdata.Thenumbersontherightofthegraphicrepresenttheproportionof rowsdisplayingthatmissingdatapattern.37.5percentoftherowscontainnomissingdata whatsoever. Typesofmissingdata TheVIMpackageallowedustovisualizethemissingdatapatterns.Arelatedterm,the missingdatamechanism,describestheprocessthatdetermineseachdatapoint’s likelihoodofbeingmissing.Therearethreemaincategoriesofmissingdatamechanisms: MissingCompletelyAtRandom(MCAR),MissingAtRandom(MAR),andMissing NotAtRandom(MNAR).Discriminationbasedonmissingdatamechanismiscrucial, sinceitinformsusabouttheoptionsforhandlingthemissingness. Thefirstmechanism,MCAR,occurswhendata’smissingnessisunrelatedtothedata. Thiswouldoccur,forexample,ifrowsweredeletedfromadatabaseatrandom,orifa gustofwindtookarandomsampleofasurveyor’ssurveyformsoffintothehorizon.The mechanismthatgovernsthemissingnessofdrat,mpg,cyl,wt,andvs‘isMCAR,because werandomlyselectedelementstogomissing.Thismechanism,whilebeingtheeasiestto workwith,isseldomtenableinpractice. MNAR,ontheotherhand,occurswhenavariable’smissingnessisrelatedtothevariable itself.Forexample,supposethescalethatweighedeachcarhadacapacityofonly3,700 pounds,andbecauseofthis,theeightcarsthatweighedmorethanthatwererecordedas NA.ThisisaclassicexampleoftheMNARmechanism—itistheweightofthe observationitselfthatisthecauseforitsbeingmissing.Anotherexamplewouldbeif duringthecourseoftrialofananti-depressantdrug,participantswhowerenotbeing helpedbythedrugbecametoodepressedtocontinuewiththetrial.Attheendofthetrial, whenalltheparticipants’levelofdepressionisaccessedandrecorded,therewouldbe missingvaluesforparticipantswhosereasonforabsenceisrelatedtotheirlevelof depression. Thelastmechanism,missingatrandom,issomewhatunfortunatelynamed.Contraryto whatitmaysoundlike,itmeansthereisasystematicrelationshipbetweenthe missingnessofanoutcomevariable’andotherobservedvariables,butnottheoutcome variableitself.Thisisprobablybestexplainedbythefollowingexample. Supposethatinasurvey,thereisaquestionaboutincomelevelthat,initswording,usesa particularcolloquialism.Duetothis,alargenumberoftheparticipantsinthesurvey whosenativelanguageisnotEnglishcouldn’tinterpretthequestion,andleftitblank.If thesurveycollectedjustthename,gender,andincome,themissingdatamechanismofthe questiononincomewouldbeMNAR.If,however,thequestionnaireincludedaquestion thataskediftheparticipantspokeEnglishasafirstlanguage,thenthemechanismwould beMAR.TheinclusionoftheIsEnglishyourfirstlanguage?variablemeansthatthe missingnessoftheincomequestioncanbecompletelyaccountedfor.Thereasonforthe monikermissingatrandomisthatwhenyoucontroltherelationshipbetweenthemissing variableandtheobservedvariable(s)itisrelatedto(forexample,Whatisyourincome? andIsEnglishyourfirstlanguage?respectively),thedataaremissingatrandom. Asanotherexample,thereisasystematicrelationshipbetweentheamandqsecvariables inoursimulatedmissingdataset:qsecsweremissingonlyforautomaticcars.Butwithin thegroupofautomaticcars,theqsecvariableismissingatrandom.Therefore,qsec‘s mechanismisMAR;controllingfortransmissiontype,qsecismissingatrandom.Bearin mind,though,ifweremovedamfromoursimulateddataset,qsecwouldbecomeMNAR. Asmentionedearlier,MCARistheeasiesttypetoworkwithbecauseofthecomplete absenceofasystematicrelationshipinthedata’smissingness.Manyunsophisticated techniquesforhandlingmissingdatarestontheassumptionthatthedataareMCAR.On theotherhand,MNARdataisthehardesttoworkwithsincethepropertiesofthemissing datathatcauseditsmissingnesshastobeunderstoodquantifiably,andincludedinthe imputationmodel.ThoughmultipleimputationscanhandletheMNARmechanisms,the proceduresinvolvedbecomemorecomplicatedandfarbeyondthescopeofthistext.The MCARandMARmechanismsallowusnottoworryaboutthepropertiesandparameters ofthemissingdata.Forthisreason,maysometimesfindMCARorMARmissingness beingreferredtoasignorablemissingness. MARdataisnotashardtoworkwithasMNARdata,butitisnotasforgivingasMCAR. Forthisreason,thoughoursimulateddatasetcontainsMCARandMARcomponents,the mechanismthatdescribesthewholedataisMAR—justoneMARmechanismmakesthe wholedatasetMAR. Sowhichoneisit? Youmayhavenoticedthattheplaceofaparticulardatasetinthemissingdatamechanism taxonomyisdependentonthevariablesthatitincludes.Forexample,weknowthatthe mechanismbehindqsecisMAR,butifthedatasetdidnotincludeam,itwouldbe MNAR.Sincewearetheonesthatcreatedthedata,weknowtheprocedurethatresulted inqsec‘smissingvalues.Ifweweren’ttheonescreatingthedata—ashappensinthereal world—andthedatasetdidnotcontaintheamcolumn,wewouldjustseeabunchof arbitrarilymissingqsecvalues.ThismightleadustobelievethatthedataisMCAR.It isn’t,though;justbecausethevariabletowhichanothervariable’smissingnessis systematicallyrelatedisnon-observed,doesn’tmeanthatitdoesn’texist. Thisraisesacriticalquestion:canweeverbesurethatourdataisnotMNAR?The unfortunateanswerisno.SincethedatathatweneedtoproveordisproveMNARisipso factomissing,theMNARassumptioncanneverbeconclusivelydisconfirmed.It’sour job,ascriticallythinkingdataanalysts,toaskwhetherthereislikelyanMNAR mechanismornot. Unsophisticatedmethodsfordealingwithmissing data Herewearegoingtolookatvarioustypesofmethodsfordealingwithmissingdata: Completecaseanalysis Thismethod,alsocalledlist-wisedeletion,isastraightforwardprocedurethatsimply removesallrowsorelementscontainingmissingvaluespriortotheanalysis.Inthe univariatecase—takingthemeanofthedratcolumn,forexample—allelementsofdrat thataremissingwouldsimplyberemoved: >mean(miss_mtcars$drat) [1]NA >mean(miss_mtcars$drat,na.rm=TRUE) [1]3.63 Inamultivariateprocedure—forexample,linearregressionpredictingmpgfromam,wt, andqsec—allrowsthathaveamissingvalueinanyofthecolumnsincludedinthe regressionareremoved: listwise_model<-lm(mpg~am+wt+qsec, data=miss_mtcars, na.action=na.omit) ##OR #complete.casesreturnsabooleanvector comp<-complete.cases(cbind(miss_mtcars$mpg, miss_mtcars$am, miss_mtcars$wt, miss_mtcars$qsec)) comp_mtcars<-mtcars[comp,] listwise_model<-lm(mpg~am+wt+qsec, data=comp_mtcars) UnderanMCARmechanism,acompletecaseanalysisproducesunbiasedestimatesofthe mean,variance/standarddeviation,andregressioncoefficients,whichmeansthatthe estimatesdon’tsystematicallydifferfromthetruevaluesonaverage,sincetheincluded dataelementsarejustarandomsamplingoftherecordeddataelements.However, inference-wise,sincewelostanumberofoursamples,wearegoingtolosestatistical powerandgeneratestandarderrorsandconfidenceintervalsthatarebiggerthantheyneed tobe.Additionally,inthemultivariateregressioncase,notethatoursamplesizedepends onthevariablesthatweincludeintheregression;morethevariables,moreisthemissing datathatweopenourselvesupto,andmoretherowsthatweareliabletolose.Thismakes comparingresultsacrossdifferentmodelsslightlyhairy. UnderanMARorMNARmechanism,list-wisedeletionwillproducebiasedestimatesof themeanandvariance.Forexample,ifamwerehighlycorrelatedwithqsec,thefactthat wearemissingqseconlyforautomaticcarswouldsignificantlyshiftourestimatesofthe meanofqsec.Surprisingly,list-wisedeletionproducesunbiasedestimatesofthe regressioncoefficients,evenifthedataisMNARorMAR,aslongastherelevant variablesareincludedintheregressionequations.Forthisreason,iftherearerelatively fewmissingvaluesinadatasetthatistobeusedinregressionanalysis,list-wisedeletion couldbeanacceptablealternativetomoreprincipledapproaches. Pairwisedeletion Alsocalledavailable-caseanalysis,thistechniqueis(somewhatunfortunately)common whenestimatingcovarianceorcorrelationmatrices.Foreachpairofvariables,itonlyuses thecasesthatarenon-missingforboth.Thisoftenmeansthatthenumberofelementsused willvaryfromcelltocellofthecovariance/correlationmatrices.Thiscanresultinabsurd correlationcoefficientsthatareabove1,makingtheresultingmatriceslargelyuselessto methodologiesthatdependonthem. Meansubstitution Meansubstitution,asthenamesuggests,replacesallthemissingvalueswiththemeanof theavailablecases.Forexample: mean_sub<-miss_mtcars mean_sub$qsec[is.na(mean_sub$qsec)]<-mean(mean_sub$qsec, na.rm=TRUE) #etc… Althoughthisseeminglysolvestheproblemofthelossofsamplesizeinthelist-wise deletionprocedure,meansubstitutionhassomeveryunsavorypropertiesofit’sown. Whilstmeansubstitutionproducesunbiasedestimatesofthemeanofacolumn,it producesbiasedestimatesofthevariance,sinceitremovesthenaturalvariabilitythat wouldhaveoccurredinthemissingvalueshadtheynotbeenmissing.Thevariance estimatesfrommeansubstitutionwillthereforebe,systematically,toosmall.Additionally, it’snothardtoseethatmeansubstitutionwillresultinbiasedestimatesifthedataare MARorMNAR.Forthesereasons,meansubstitutionisnotrecommendedundervirtually anycircumstance. Hotdeckimputation Hotdeckimputationisanintuitivelyelegantapproachthatfillsinthemissingdatawith donorvaluesfromanotherrowinthedataset.Intheleastsophisticatedformulation,a randomnon-missingelementfromthesamedatasetissharedwithamissingvalue.In moresophisticatedhotdeckapproaches,thedonorvaluecomesfromarowthatissimilar totherowwiththemissingdata.Themultipleimputationtechniquesthatwewillbeusing inalatersectionofthischapterborrowsthisideaforoneofitsimputationmethods. Note Thetermhotdeckreferstotheoldpracticeofstoringdataindecksofpunchcards.The deckthatholdsthedonorvaluewouldbehotbecauseitistheonethatiscurrentlybeing processed. Regressionimputation Thisapproachattemptstofillinthemissingdatainacolumnusingregressiontopredict likelyvaluesofthemissingelementsusingothercolumnsaspredictors.Forexample, usingregressionimputationonthedratcolumnwouldemployalinearregression predictingdratfromalltheothercolumnsinmiss_mtcars.Theprocesswouldbe repeatedforallcolumnscontainingmissingdata,untilthedatasetiscomplete. Thisprocedureisintuitivelyappealing,becauseitintegratesknowledgeoftheother variablesandpatternsofthedataset.Thiscreatesasetofmoreinformedimputations.Asa result,thisproducesunbiasedestimatesofthemeanandregressioncoefficientsunder MCARandMAR(solongastherelevantvariablesareincludedintheregressionmodel. However,thisapproachisnotwithoutitsproblems.Thepredictedvaluesofthemissing datalierightontheregressionlinebut,asweknow,veryfewdatapointslierightonthe regressionline—thereisusuallyanormallydistributedresidual(error)term.Duetothis, regressionimputationunderestimatesthevariabilityofthemissingvalues.Asaresult,it willresultinbiasedestimatesofthevarianceandcovariancebetweendifferentcolumns. However,we’reontherighttrack. Stochasticregressionimputation Asfarasunsophisticatedapproachesgo,stochasticregressionisfairlyevolved.This approachsolvessomeoftheissuesofregressionimputation,andproducesunbiased estimatesofthemean,variance,covariance,andregressioncoefficientsunderMCARand MAR.Itdoesthisbyaddingarandom(stochastic)valuetothepredictionsofregression imputation.Thisrandomaddedvalueissampledfromtheresidual(error)distributionof thelinearregression—which,ifyouremember,isassumedtobeanormaldistribution. Thisrestoresthevariabilityinthemissingvalues(thatwelostinregressionimputation) thatthosevalueswouldhavehadiftheyweren’tmissing. However,asfarassubsequentanalysisandinferenceontheimputeddatasetgoes, stochasticregressionresultsinstandarderrorsandconfidenceintervalsthataresmaller thantheyshouldbe.Sinceitproducesonlyoneimputeddataset,itdoesnotcapturethe extenttowhichweareuncertainabouttheresidualsandourcoefficientestimates. Nevertheless,stochasticregressionformsthebasisofstillmoresophisticatedimputation methods. Therearetwosophisticated,well-founded,andrecommendedmethodsofdealingwith missingdata.OneiscalledtheExpectationMaximization(EM)method,whichwedo notcoverhere.ThesecondiscalledMultipleImputation,andbecauseitiswidely consideredthemosteffectivemethod,itistheoneweexploreinthischapter. Multipleimputation Thebigideabehindmultipleimputationisthatinsteadofgeneratingonesetofimputed datawithourbestestimationofthemissingdata,wegeneratemultipleversionsofthe imputeddatawheretheimputedvaluesaredrawnfromadistribution.Theuncertainty aboutwhattheimputedvaluesshouldbeisreflectedinthevariationbetweenthemultiply imputeddatasets. Weperformourintendedanalysisseparatelywitheachofthesemamountofcompleted datasets.Theseanalyseswillthenyieldmdifferentparameterestimates(likeregression coefficients,andsoon).Thecriticalpointisthattheseparameterestimatesaredifferent solelyduetothevariabilityintheimputedmissingvalues,andhence,ouruncertainty aboutwhattheimputedvaluesshouldbe.Thisishowmultipleimputationintegrates uncertainty,andoutperformsmorelimitedimputationmethodsthatproduceoneimputed dataset,conferringanunwarrantedsenseofconfidenceinthefilled-indataofouranalysis. Thefollowingdiagramillustratesthisidea: Figure11.2:Multipleimputationinanutshell Sohowdoesmicecomeupwiththeimputedvalues? Let’sfocusontheunivariatecase—whereonlyonecolumncontainsmissingdataandwe usealltheother(completed)columnstoimputethemissingvalues—beforegeneralizing toamultivariatecase. miceactuallyhasafewdifferentimputationmethodsupitssleeve,eachbestsuitedfora particularusecase.micewilloftenchoosesensibledefaultsbasedonthedatatype (continuous,binary,non-binarycategorical,andsoon). Themostimportantmethodiswhatthepackagecallsthenormmethod.Thismethodis verymuchlikestochasticregression.Eachofthemimputationsiscreatedbyaddinga normal“noise”termtotheoutputofalinearregressionpredictingthemissingvariable. Whatmakesthisslightlydifferentthanjuststochasticregressionrepeatedmtimesisthat thenormmethodalsointegratesuncertaintyabouttheregressioncoefficientsusedinthe predictivelinearmodel. Recallthattheregressioncoefficientsinalinearregressionarejustestimatesofthe populationcoefficientsfromarandomsample(that’swhyeachregressioncoefficienthas astandarderrorandconfidenceinterval).Anothersamplefromthepopulationwouldhave yieldedslightlydifferentcoefficientestimates.Ifthroughallourimputations,wejust addedanormalresidualtermfromalinearregressionequationwiththesamecoefficients, wewouldbesystematicallyunderstatingouruncertaintyregardingwhattheimputed valuesshouldbe. Tocombatthis,inmultipleimputation,eachimputationofthedatacontainstwosteps. Thefirststepperformsstochasticlinearregressionimputationusingcoefficientsforeach predictorestimatedfromthedata.Thesecondstepchoosesslightlydifferentestimatesof theseregressioncoefficients,andproceedsintothenextimputation.Thefirststepofthe nextimputationusestheslightlydifferentcoefficientestimatestoperformstochastic linearregressionimputationagain.Afterthat,inthesecondstepoftheseconditeration, stillothercoefficientestimatesaregeneratedtobeusedinthethirdimputation.Thiscycle goesonuntilwehavemmultiplyimputeddatasets. Howdowechoosethesedifferentcoefficientestimatesatthesecondstepofeach imputation?Traditionally,theapproachisBayesianinnature;thesenewcoefficientsare drawnfromeachofthecoefficients’posteriordistribution,whichdescribescrediblevalues oftheestimateusingtheobserveddataanduninformativepriors.Thisistheapproachthat normuses.Thereisanalternatemethodthatchoosesthesenewcoefficientestimatesfrom asamplingdistributionthatiscreatedbytakingrepeatedsamplesofthedata(with replacement)andestimatingtheregressioncoefficientsofeachofthesesamples.mice callsthismethodnorm.boot. Themultivariatecaseisalittlemorehairy,sincetheimputationforonecolumndepends ontheothercolumns,whichmaycontainmissingdataoftheirown. Forthisreason,wemakeseveralpassesoverallthecolumnsthatneedimputing,untilthe imputationofallmissingdatainaparticularcolumnisinformedbyinformedestimatesof themissingdatainthepredictorcolumns.Thesepassesoverallthecolumnsarecalled iterations. Sothatyoureallyunderstandhowthisiterationworks,let’ssayweareperforming multipleimputationonasubsetofmiss_mtcarscontainingonlympg,wtanddrat.First, allthemissingdatainallthecolumnsaresettoaplaceholdervaluelikethemeanora randomlysamplednon-missingvaluefromitscolumn.Then,wevisitmpgwherethe placeholdervaluesareturnedbackintomissingvalues.Thesemissingvaluesarepredicted usingthetwo-partproceduredescribedintheunivariatecase.Thenwemoveontowt;the placeholdervaluesareturnedbackintomissingvalues,whosenewvaluesareimputed withthetwo-stepunivariateprocedureusingmpganddrataspredictors.Thenthisis repeatedwithdrat.Thisisoneiteration.Onthenextiteration,itisnottheplaceholder valuesthatgetturnedbackintorandomvaluesandimputedbuttheimputedvaluesfrom thepreviousiteration.Asthisrepeats,weshiftawayfromthestartingvaluesandthe imputedvaluesbegintostabilize.Thisusuallyhappenswithinjustafewiterations.The datasetatthecompletionofthelastiterationisthefirstmultiplyimputeddataset.Eachm startstheiterationprocessalloveragain. Thedefaultinmiceisfiveiterations.Ofcourse,youcanincreasethisnumberifyouhave reasontobelievethatyouneedto.We’lldiscusshowtotellifthisisnecessarylaterinthe section. Methodsofimputation Themethodofimputationthatwedescribedfortheunivariatecase,norm,worksbestfor imputedvaluesthatfollowanunconstrainednormaldistribution—butitcouldleadto somenonsensicalimputationsotherwise.Forexample,sincetheweightsinwtaresoclose to0(becauseit’sinunitsofathousandpounds)itispossibleforthenormmethodto imputeanegativeweight.Thoughthiswillnodoubtbalanceoutovertheotherm-1 multiplyimputeddatasets,wecancombatthissituationbyusinganothermethodof imputationcalledpredictivemeanmatching. Predictivemeanmatching(micecallsthispmm)worksalotlikenorm.Thedifferenceisthat thenormimputationsarethenusedtofindthedclosestvaluestotheimputedvalueamong thenon-missingdatainthecolumn.Then,oneofthesedvaluesischosenasthefinal imputedvalue—d=3isthedefaultinmice. Thismethodhasafewgreatproperties.Forone,thepossibilityofimputinganegative valueforwtiscategoricallyoffthetable;theimputedvaluewouldhavetobechosenfrom theset{1.513,1.615,1.835},sincethesearethethreelowestweights.Moregenerally,any naturalconstraintinthedata(lowerorupperbounds,integercountdata,numbersrounded tothenearestone-half,andsoon)isrespectedwithpredictivemeanmatching,becausethe imputedvaluesappearintheactualnon-missingobservedvalues.Inthisway,predictive meanmatchingislikehot-deckimputation.Predictivemeanmatchingisthedefault imputationmethodinmicefornumericaldata,thoughitmaybeinferiortonormforsmall datasetsand/ordatasetswithalotofmissingvalues. Manyoftheotherimputationmethodsinmicearespeciallysuitedforoneparticulardata type.Forexample,binarycategoricalvariablesuselogregbydefault;thisislikenormbut useslogisticregressiontoimputeabinaryoutcome.Similarly,non-binarycategoricaldata usesmultinomialregression—micecallsthismethodpolyreg. Multipleimputationinpractice Thereareafewstepstofollowanddecisionstomakewhenusingthispowerful imputationtechnique: ArethedataMAR?:Andbehonest!IfthemechanismislikelynotMAR,thenmore complicatedmeasureshavetobetaken. Arethereanyderivedterms,redundantvariables,orirrelevantvariablesinthedata set?:Anyofthesetypesofvariableswillinterferewiththeregressionprocess. Irrelevantvariables—likeuniqueIDs—willnothaveanypredictivepower.Derived termsorredundantvariables—likehavingacolumnforweightinpoundsandgrams, oracolumnforareainadditiontoalengthandwidthcolumn—willsimilarly interferewiththeregressionstep. Convertallcategoricalvariablestofactors,otherwisemicewillnotbeabletotellthat thevariableissupposedtobecategorical. Choosenumberofiterationsandm:Bydefault,thesearebothfive.Usingfive iterationsisusuallyokay—andwe’llbeabletotellifweneedmore.Fiveimputations areusuallyokay,too,butwecanachievemorestatisticalpowerfrommoreimputed datasets.Isuggestsettingmto20,unlesstheprocessingpowerandtimecan’tbe spared. Chooseanimputationmethodforeachvariable:Youcanstickwiththedefaultsas longasyouareawareofwhattheyareandthinkthey’retherightfit. 1. Choosethepredictors:Letmiceusealltheavailablecolumnsaspredictorsas longasderivedtermsandredundant/irrelevantcolumnsareremoved.Notonly doesusingmorepredictorsresultinreducedbias,butitalsoincreasesthe likelihoodthatthedataisMAR. 2. Performtheimputations 3. Audittheimputations 4. Performanalysiswiththeimputations 5. Pooltheresultsoftheanalyses Beforewegetdowntoit,let’scallthemicefunctiononourdataframewithmissingdata, anduseitsdefaultarguments,justtoseewhatweshouldn’tdoandwhy: #wearegoingtosettheseedandprintFlagtoFALSE,but #everythingelsewillthedefaultargument imp<-mice(miss_mtcars,seed=3,printFlag=FALSE) print(imp) -----------------------------Multiplyimputeddataset Call: mice(data=miss_mtcars,printFlag=FALSE,seed=3) Numberofmultipleimputations:5 Missingcellspercolumn: mpgcyldisphpdratwtqsecvsamgearcarb 55007343000 Imputationmethods: mpgcyldisphpdratwtqsecvsamgearcarb "pmm""pmm""""""pmm""pmm""pmm""pmm""""""" VisitSequence: mpgcyldratwtqsecvs 125678 PredictorMatrix: mpgcyldisphpdratwtqsecvsamgearcarb mpg01111111111 cyl10111111111 disp00000000000 ... Randomgeneratorseedvalue:3 Thefirstthingwenotice(onlinefouroftheoutput)isthatmicechosetocreatefive multiplyimputeddatasets,bydefault.Aswediscussed,thisisn’tabaddefault,butmore imputationcanonlyimproveourstatisticalpower(ifonlymarginally);whenweimpute thisdatasetinearnest,wewillusem=20. Thesecondthingwenotice(onlines8-10oftheoutput)isthatitusedpredictivemean matchingastheimputationmethodforallthecolumnswithmissingdata.Ifyourecall, predictivemeanmatchingisthedefaultimputationmethodfornumericcolumns. However,vsandcylarebinarycategoricalandnon-binarycategoricalvariables, respectively.Becausewedidn’tconvertthemtofactors,micethinksthesearejustregular numericcolumns.We’llhavetofixthis. Thelastthingweshouldnoticehereisthepredictormatrix(startingonline14).Eachrow andcolumnofthepredictormatrixreferstoacolumninthedatasettoimpute.Ifacell containsa1,itmeansthatthevariablereferredtointhecolumnisusedasapredictorfor thevariableintherow.Thefirstrowindicatesthatallavailableattributesareusedtohelp predictmpgwiththeexceptionofmpgitself.Allthevaluesinthediagonalare0,because micewon’tuseanattributetopredictitself.Notethatthedisp,hp,am,gear,andcarb rowsallcontain`0`s—thisisbecausethesevariablesarecomplete,anddon’tneedtouse anypredictors. Sincewethoughtcarefullyaboutwhethertherewereanyattributesthatshouldbe removedbeforeweperformtheimputation,wecanusemice‘sdefaultpredictormatrixfor thisdataset.Iftherewereanynon-predictiveattributes(likeuniqueidentifiers,redundant variables,andsoon)wewouldhaveeitherhadtoremovethem(easiestoption),orinstruct micenottousethemaspredictors(harder). Let’snowcorrecttheissuesthatwe’vediscussed. #convertcategoricalvariablesintofactors miss_mtcars$vs<-factor(miss_mtcars$vs) miss_mtcars$cyl<-factor(miss_mtcars$cyl) imp<-mice(miss_mtcars,m=20,seed=3,printFlag=FALSE) imp$method ------------------------------------mpgcyldisphpdrat "pmm""polyreg""""""pmm" wtqsecvsamgear "pmm""pmm""logreg""""" carb "" Nowmicehascorrectedtheimputationmethodofcylandvstotheircorrectdefaults.In truth,cylisakindofdiscretenumericvariablecalledanordinalvariable,whichmeans thatyetanotherimputationmethodmaybeoptimalforthatattribute,but,forthesakeof simplicity,we’lltreatitasacategoricalvariable. Beforewegettousetheimputationsinananalysis,wehavetochecktheoutput.Thefirst thingweneedtocheckistheconvergenceoftheiterations.Recallthatforimputingdata withmissingvaluesinmultiplecolumns,multipleimputationrequiresiterationoverall thesecolumnsafewtimes.Ateachiteration,miceproducesimputations—andsamples newparameterestimatesfromtheparameters’posteriordistributions—forallcolumnsthat needtobeimputed.Thefinalimputations,foreachmultiplyimputeddatasetm,arethe imputedvaluesfromthefinaliteration. IncontrasttowhenweusedMCMCinChapter7,BayesianMethodstheconvergencein miceismuchfaster;itusuallyoccursinjustafewiterations.However,asinChapter7, BayesianMethods,visuallycheckingforconvergenceishighlyrecommended.Weeven checkforitsimilarly;whenwecalltheplotfunctiononthevariablethatweassignthe miceoutputto,itdisplaystraceplotsofthemeanandstandarddeviationofallthe variablesinvolvedintheimputations.Eachlineineachplotisoneofthemimputations. plot(imp) Figure11.3:Asubsetofthetraceplotsproducedbyplottinganobjectreturnedbyamice imputation Asyoucanseefromtheprecedingtraceplotonimp,therearenocleartrendsandthe variablesarealloverlappingfromoneiterationtothenext.Putanotherway,thevariance withinachain(therearemchains)shouldbeaboutequaltothevariancebetweenthe chains.Thisindicatesthatconvergencewasachieved. Ifconvergencewasnotachieved,youcanincreasethenumberofiterationsthatmice employsbyexplicitlyspecifyingthemaxitparametertothemicefunction. Note Toseeanexampleofnon-convergence,takealookatFigures7and8inthepaperthat describesthispackagewrittenbytheauthorsofthepackage’themselves.Itisavailableat http://www.jstatsoft.org/article/view/v045i03. Thenextstepistomakesuretheimputedvaluesarereasonable.Ingeneral,wheneverwe quicklyreviewtheresultsofsomethingtoseeiftheymakesense,itiscalledasanitytest orsanitycheck.Withthefollowingline,we’regoingtodisplaytheimputedvaluesforthe fivemissingmpgsforthefirstsiximputations: imp$imp$mpg[,1:6] ------------------------------------ 123456 Duster36019.216.417.315.515.019.2 CadillacFleetwood15.213.315.013.310.417.3 ChryslerImperial10.415.015.016.410.410.4 Porsche914-227.322.821.422.821.415.5 FerrariDino19.221.419.215.218.119.2 Thesesurelookreasonable.Abettermethodforsanitycheckingistocalldensityploton thevariablethatweassignthemiceoutputto: densityplot(imp) Figure11.4:Densityplotsofalltheimputedvaluesformpg,drat,wt,andqsec.Each imputationhasitsowndensitycurveineachquadrant Thisdisplays,foreveryattributeimputed,adensityplotoftheactualnon-missingvalues (thethickline)andtheimputedvalues(thethinlines).Wearelookingtoseethatthe distributionsaresimilar.Notethatthedensitycurveoftheimputedvaluesextendmuch higherthantheobservedvalues’densitycurveinthiscase.Thisispartlybecausewe imputedsofewvariablesthatthereweren’tenoughdatapointstoproperlysmooththe densityapproximation.Heightandnon-smoothnessnotwithstanding,thesedensityplots indicatenooutlandishbehavioramongtheimputedvariables. Wearenowreadyfortheanalysisphase.Wearegoingtoperformlinearregressionon eachimputeddatasetandattempttomodelmpgasafunctionofam,wt,andqsec.Instead ofrepeatingtheanalysesoneachdatasetmanually,wecanapplyanexpressiontoallthe datasetsatonetimewiththewithfunction,asfollows: imp_models<-with(imp,lm(mpg~am+wt+qsec)) Wecouldtakeapeakattheestimatedcoefficientsfromeachdatasetusinglapplyonthe analysesattributeofthereturnedobject: lapply(imp_models$analyses,coef) --------------------------------[[1]] (Intercept)amwtqsec 18.15340952.0284014-4.40548250.8637856 [[2]] (Intercept)amwtqsec 8.3754553.336896-3.5208821.219775 [[3]] (Intercept)amwtqsec 5.2545783.277198-3.2330961.337469…...... Finally,let’spooltheresultsoftheanalyses(withthepoolfunction),andcallsummaryon it: pooled_model<-pool(imp_models) summary(pooled_model) ---------------------------------estsetdfPr(>|t|) (Intercept)7.0497819.22545810.76416617.633190.454873254 am3.1820491.74454441.82400021.366000.082171407 wt-3.4135340.9983207-3.41927614.998160.003804876 qsec1.2707120.36601313.47176519.932960.002416595 lo95hi95nmisfmilambda (Intercept)-12.361128126.460690NA0.34591970.2757138 am-0.44214956.80624700.22903590.1600952 wt-5.5414268-1.28564130.43248280.3615349 qsec0.50705702.03436640.27360260.2042003 ThoughwecouldhaveperformedthepoolingourselvesusingtheequationsthatDonald Rubinoutlinedinhis1987classicMultipleImputationforNonresponseinSurveys,itis lessofahassleandlesserror-pronetohavethepoolfunctiondoitforus.Readerswhoare interestedinthepoolingrulesareencouragedtoconsulttheaforementionedtext. Asyoucansee,foreachparameter,poolhascombinedthecoefficientestimateand standarderrors,andcalculatedtheappropriatedegreesoffreedom.Theseallowustot-test eachcoefficientagainstthenullhypothesisthatthecoefficientisequalto0,producepvaluesforthet-test,andconstructconfidenceintervals. Thestandarderrorsandconfidenceintervalsarewiderthanthosethatwouldhaveresulted fromlinearregressiononasingleimputeddataset,butthat’sbecauseitisappropriately takingintoaccountouruncertaintyregardingwhatthemissingvalueswouldhavebeen. Thereare,atpresenttime,alimitednumberofanalysesthatcanbeautomaticallypooled bymice—themostimportantbeinglm/glm.Ifyourecall,though,thegeneralizedlinear modelisextremelyflexible,andcanbeusedtoexpressawidearrayofdifferentanalyses. Byextension,wecouldusemultipleimputationfornotonlylinearregressionbutlogistic regression,Poissonregression,t-tests,ANOVA,ANCOVA,andmore. Analysiswithunsanitizeddata Veryoften,therewillbeerrorsormistakesindatathatcanseverelycomplicateanalyses— especiallywithpublicdataordataoutsideofyourorganization.Forexample,saythereis astraycommaorpunctuationmarkinacolumnthatwassupposedtobenumeric.Ifwe aren’tcareful,Rwillreadthiscolumnascharacter,andsubsequentanalysismay,inthe bestcasescenario,fail;itisalsopossible,however,thatouranalysiswillsilentlychug along,andreturnanunexpectedresult.Thiswillhappen,forexample,ifwetrytoperform linearregressionusingthepunctuation-containing-but-otherwise-numericcolumnasa predictor,whichwillcompelRtoconvertitintoafactorthinkingthatitisacategorical variable. Intheworst-casescenario,ananalysiswithunsanitizeddatamaynoterroroutorreturn nonsensicalresults,butreturnresultsthatlookplausiblebutareactuallyincorrect.For example,itiscommon(forsomereason)toencodemissingdatawith999insteadofNA; performingaregressionanalysiswith999inanumericcolumncanseverelyadulterateour linearmodels,butoftennotenoughtocauseclearlyinappropriateresults.Thismistake maythengoundetectedindefinitely. Someproblemslikethesecould,rathereasily,bedetectedinsmalldatasetsbyvisually auditingthedata.Often,however,mistakeslikethesearenotoriouslyeasytomiss. Further,visualinspectionisanuntenablesolutionfordatasetswiththousandsofrowsand hundredsofcolumns.Anysustainablesolutionmustoff-loadthisauditingprocesstoR. ButhowdowedescribeaberrantbehaviortoRsothatitcancatchmistakesonitsown? Thepackageassertrseekstodothisbyintroducinganumberofdatacheckingverbs. Usingassertrgrammar,theseverbs(functions)canbecombinedwithsubjects(data)in differentwaystoexpressarichvocabularyofdatavalidationtasks. Moreprosaically,assertrprovidesasuiteoffunctionsdesignedtoverifytheassumptions aboutdataearlyintheanalysisprocess,beforeanytimeiswastedcomputingonbaddata. Theideaistoprovideasmuchinformationasyoucanabouthowyouexpectthedatato lookupfrontsothatanydeviationfromthisexpectationcanbedealtwithimmediately. Giventhattheassertrgrammarisdesignedtobeabletodescribeabouquetoferrorcheckingroutines,ratherthanlistallthefunctionsandfunctionalitiesthatthepackage provides,itwouldbemorehelpfultovisitparticularusecases. Twothingsbeforewestart.First,makesureyouinstallassertr.Second,bearinmind thatalldataverificationverbsinassertrtakeadataframetocheckastheirfirst argument,andeither(a)returnsthesamedataframeifthecheckpasses,or(b)producesa fatalerror.Sincetheverbsreturnacopyofthechosendataframeifthecheckpasses,the mainidiominassertrinvolvesreassignmentofthereturningdataframeafteritpasses thecheck. a_dataset<-CHECKING_VERB(a_dataset,....) Checkingforout-of-boundsdata It’scommonfornumericvaluesinacolumntohaveanaturalconstraintonthevaluesthat itshouldhold.Forexample,ifacolumnrepresentsapercentofsomething,wemightwant tocheckifallthevaluesinthatcolumnarebetween0and1(or0and100).Inassertr, wetypicallyusethewithin_boundsfunctioninconjunctionwiththeassertverbto ensurethatthisisthecase.Forexample,ifweaddedacolumntomtcarsthatrepresented thepercentofheaviestcar’sweight,theweightofeachcaris: library(assertr) mtcars.copy<-mtcars mtcars.copy$Percent.Max.Wt<-round(mtcars.copy$wt/ max(mtcars.copy$wt), 2) mtcars.copy<-assert(mtcars.copy,within_bounds(0,1), Percent.Max.Wt) within_boundsisactuallyafunctionthattakesthelowerandupperboundsandreturnsa predicate,afunctionthatreturnsTRUEorFALSE.Theassertfunctionthenappliesthis predicatetoeveryelementofthecolumnspecifiedinthethirdargument.Iftherearemore thanthreearguments,assertwillassumetherearemorecolumnstocheck. Usingwithin_bounds,wecanalsoavoidthesituationwhereNAvaluesarespecifiedas “999”,aslongasthesecondargumentinwithin_boundsislessthanthisvalue. within_boundscantakeotherinformationsuchaswhethertheboundsshouldbeinclusive orexclusive,orwhetheritshouldignoretheNAvalues.Toseetheoptionsforthis,andall theotherfunctionsinassertr,usethehelpfunctiononthem. Let’sseeanexampleofwhatitlookslikewhentheassertfunctionfails: mtcars.copy$Percent.Max.Wt[c(10,15)]<-2 mtcars.copy<-assert(mtcars.copy,within_bounds(0,1), Percent.Max.Wt) -----------------------------------------------------------Error: Vector'Percent.Max.Wt'violatesassertion'within_bounds'2times(e.g. [2]atindex10) Wegetaninformativeerrormessagethattellsushowmanytimestheassertionwas violated,andtheindexandvalueofthefirstoffendingdatum. Withassert,wehavetheoptionofcheckingaconditiononmultiplecolumnsatthesame time.Forexample,noneofthemeasurementsiniriscanpossiblybenegative.Here’s howwemightmakesureourdatasetiscompliant: iris<-assert(iris,within_bounds(0,Inf), Sepal.Length,Sepal.Width, Petal.Length,Petal.Width) #orsimply"-Species"becausethat #willincludeallcolumns*except*Species iris<-assert(iris,within_bounds(0,Inf), -Species) Onoccasion,wewillwanttocheckelementsforadherencetoamorecomplicatedpattern. Forexample,let’ssaywehadacolumnthatweknewwaseitherbetween-10and-20,or 10and20.Wecancheckforthisbyusingthemoreflexibleverifyverb,whichtakesa logicalexpressionasitssecondargument;ifanyoftheresultsinthelogicalexpressionis FALSE,verifywillcauseanerror. vec<-runif(10,min=10,max=20) #randomlyturnsomeelementsnegative vec<-vec*sample(c(1,-1),10, replace=TRUE) example<-data.frame(weird=vec) example<-verify(example,((weird<20&weird>10)| (weird<-10&weird>-20))) #or example<-verify(example,abs(weird)<20&abs(weird)>10) #passes example$weird[4]<-0 example<-verify(example,abs(weird)<20&abs(weird)>10) #fails ------------------------------------Errorinverify(example,abs(weird)<20&abs(weird)>10): verificationfailed!(1failure) Checkingthedatatypeofacolumn Bydefault,mostofthedataimportfunctionsinRwillattempttoguessthedatatypefor eachcolumnattheimportphase.Thisisusuallynice,becauseitsavesusfromtedious work.However,itcanbackfirewhenthereare,forexample,straypunctuationmarksin whataresupposedtobenumericcolumns.Toverifythis,wecanusetheassertfunction withtheis.numericbasefunction: iris<-assert(iris,is.numeric,-Species) Wecanusetheis.characterandis.logicalfunctionswithassert,too. Analternativemethodthatwilldisallowtheimportofunexpecteddatatypesistospecify thedatatypethateachcolumnshouldbeatthedataimportphasewiththecolClasses optionalargument: iris<-read.csv("PATH_TO_IRIS_DATA.csv", colClasses=c("numeric","numeric", "numeric","numeric", "character")) Thissolutioncomeswiththeaddedbenefitofspeedingupthedataimportprocess,since Rdoesn’thavetowastetimeguessingeachcolumn’sdatatype. Checkingforunexpectedcategories Anotherdataintegrityimproprietythatis,unfortunately,verycommonisthemislabeling ofcategoricalvariables.Therearetwotypesofmislabelingofcategoriesthatcanoccur: anobservation’sclassismis-entered/mis-recorded/mistakenforthatofanotherclass,or theobservation’sclassislabeledinawaythatisnotconsistentwiththerestofthelabels. Toseeanexampleofwhatwecandotocombattheformercase,readassertr‘svignette. Thelattercasecoversinstanceswhere,forexample,thespeciesofiriscouldbemisspelled (suchas“versicolour”,“verginica”)orcaseswherethepatternestablishedbythemajority ofclassnamesisignored(“irissetosa”,“i.setosa”,“SETOSA”).Eitherway,these misspecificationsprovetobeagreatbanetodataanalystsforseveralreasons.For example,ananalysisthatispredicateduponatwo-classcategoricalvariable(forexample, logisticregression)willnowhavetocontendwithmorethantwocategories.Yetanother wayinwhichunexpectedcategoriescanhauntyouisbyproducingstatisticsgroupedby differentvaluesofacategoricalvariable;ifthecategorieswereextractedfromthemain datamanually—withsubset,forexample,asopposedtowithby,tapply,oraggregate— you’llbemissingpotentiallycrucialobservations. Ifyouknowwhatcategoriesyouareexpectingfromthestart,youcanusethein_set function,inconcertwithassert,toconfirmthatallthecategoriesofaparticularcolumn aresquarelycontainedwithinapredeterminedset. #passes iris<-assert(iris,in_set("setosa","versicolor", "virginica"),Species) #messupthedata iris.copy<-iris #Wehavetomakethe'Species'columnnot #afactor ris.copy$Species<-as.vector(iris$Species) iris.copy$Species[4:9]<-"SETOSA" iris.copy$Species[135]<-"verginica" iris.copy$Species[95]<-"i.versicolor" #fails iris.copy<-assert(iris.copy,in_set("setosa","versicolor", "virginica"),Species) ------------------------------------------Error: Vector'Species'violatesassertion'in_set'8times(e.g.[SETOSA]at index4) Ifyoudon’tknowthecategoriesthatyoushouldbeexpecting,apriori,thefollowing incantation,whichwilltellyouhowmanyrowseachcategorycontains,mayhelpyou identifythecategoriesthatareeitherrareormisspecified: by(iris.copy,iris.copy$Species,nrow) Checkingforoutliers,entryerrors,orunlikelydata points Automaticoutlierdetection(sometimesknownasanomalydetection)issomethingthata lotofanalystsscoffatandviewasapipedream.Thoughthecreationofaroutinethat automagicallydetectsallerroneousdatapointswith100percentspecificityandprecision isimpossible,unmistakablymis-entereddatapointsandflagrantoutliersarenothardto detectevenwithverysimplemethods.Inmyexperience,therearealotoferrorsofthis type. Onesimplewaytodetectthepresenceofamajoroutlieristoconfirmthateverydata pointiswithinsomennumberofstandarddeviationsawayfromthemeanofthegroup. assertrhasafunction,within_n_sds—inconjunctionwiththeinsistverb—todojust this;ifwewantedtocheckthateverynumericvalueinirisiswithinfivestandard deviationsofitsrespectivecolumn’smean,wecouldexpresssothusly: iris<-insist(iris,within_n_sds(5),-Species) Anissuewithusingstandarddeviationsawayfromthemean(z-scores)fordetecting outliersisthatboththemeanandstandarddeviationareinfluencedheavilybyoutliers; thismeansthattheverythingwearetryingtodetectisobstructingourabilitytofindit. Thereisamorerobustmeasureforfindingcentraltendencyanddispersionthanthemean andstandarddeviation:themedianandmedianabsolutedeviation.Themedianabsolute deviationisthemedianoftheabsolutevalueofalltheelementsofavectorsubtractedby thevector’smedian. assertrhasasistertowithin_n_sds,within_n_mads,thatcheckseveryelementofa vectortomakesureitiswithinnmedianabsolutedeviationsawayfromitscolumn’s median. iris<-insist(iris,within_n_mads(4),-Species) iris$Petal.Length[5]<-15 iris<-insist(iris,within_n_mads(4),-Species) --------------------------------------------Error: Vector'Petal.Length'violatesassertion'within_n_mads'1time(value[15] atindex5) Inmyexperience,within_n_madscanbeaneffectiveguardagainstillegitimateunivariate outliersifnischosencarefully. Theexamplesherehavebeenfocusingonoutlieridentificationintheunivariatecase— acrossonedimensionatatime.Often,therearetimeswhereanobservationistruly anomalousbutitwouldn’tbeevidentbylookingatthespreadofeachdimension individually.assertrhassupportforthistypeofmultivariateoutlieranalysis,butafull discussionofitwouldrequireabackgroundoutsidethescopeofthistext. Chainingassertions Thecheckassertraimstomakethecheckingofassumptionssoeffortlessthattheuser neverfeelstheneedtoholdbackanyimplicitassumption.Therefore,it’sexpectedthatthe userusesmultiplechecksononedataframe. Theusageexamplesthatwe’veseensofararereallyonlyappropriateforoneortwo checks.Forexample,ausagepatternsuchasthefollowingisclearlyunworkable: iris<- CHECKING_CONSTRUCT4(CHECKING_CONSTRUCT3(CHECKING_CONSTRUCT2(CHECKING_CONSTR UCT1(this,...),...),...),...) Tocombatthisvisualcacophony,assertrprovidesdirectsupportforchainingmultiple assertionsbyusingthe“piping”constructfromthemagrittrpackage. Thepipeoperatorofmagrittr‘,%>%,worksasfollows:ittakestheitemontheleft-hand sideofthepipeandinsertsit(bydefault)intothepositionofthefirstargumentofthe functionontheright-handside.Thefollowingaresomeexamplesofsimplemagrittr usagepatterns: library(magrittr) 4%>%sqrt#2 iris%>%head(n=3)#thefirst3rowsofiris iris<-iris%>%assert(within_bounds(0,Inf),-Species) Sincethereturnvalueofapassedassertrcheckisthevalidateddataframe,youcanuse themagrittrpipeoperatortotackonmorechecksinawaythatlendsitselftoeasier humanunderstanding.Forexample: iris<-iris%>% assert(is.numeric,-Species)%>% assert(within_bounds(0,Inf),-Species)%>% assert(in_set("setosa","versicolor","virginica"),Species)%>% insist(within_n_mads(4),-Species) #or,equivalently CHECKS<-.%>% assert(is.numeric,-Species)%>% assert(within_bounds(0,Inf),-Species)%>% assert(in_set("setosa","versicolor","virginica"),Species)%>% insist(within_n_mads(4),-Species) iris<-iris%>%CHECKS Whenchainingassertions,Iliketoputthemostintegralandgeneralonerightatthetop.I alsoliketoputtheassertionsmostlikelytobeviolatedrightatthetopsothatexecutionis terminatedbeforeanymorechecksarerun. Therearemanyothercapabilitiesbuiltintotheassertrmultivariateoutlierchecking.For moreinformationaboutthese,readthepackage’svignette,(vignette("assertr")). Onthemagrittrside,besidestheforward-pipeoperator,thispackagesportssomeother veryhelpfulpipeoperators.Additionally,magrittrallowsthesubstitutionattherightside ofthepipeoperatortooccuratlocationsotherthanthefirstargument.Formore informationaboutthewonderfulmagrittrpackage,readitsvignette. Othermessiness Aswediscussedinthischapter’spreface,therearecountlesswaysthatadatasetmaybe messy.Therearemanyothermessysituationsandsolutionsthatwecouldn’tdiscussat lengthhere.Inorderthatyou,dearreader,arenotleftinthedarkregardingcustodial solutions,herearesomeotherremedieswhichyoumayfindhelpfulalongyouranalytics journey: OpenRefine ThoughOpenRefine(formerlyGoogleRefine)doesn’thaveanythingtodowithRperse, itisasophisticatedtoolforworkingwithandforcleaningupmessydata.Amongits numerous,sophisticatedcapabilitiesisthecapacitytoauto-detectmisspelledor mispecifiedcategoriesandfixthemattheclickofabutton. Regularexpressions Supposeyoufindthattherearecommasseparatingeverythirddigitofthenumbersina numericcolumn.Howwouldyouremovethem?Orsupposeyouneededtostripa currencysymbolfromvaluesincolumnsthatholdmonetaryvaluessothatyoucan computewiththemasnumbers.These,andvastlymorecomplicatedtexttransformations, canbeperformedusingregularexpressions(aformalgrammarforspecifyingthesearch patternsintext)andassociateRfunctionslikegrepandsub.Anytimespentlearning regularexpressionswillpayenormousdividendsoveryourcareerasananalyst,andthere aremanygreat,freetutorialsavailableonthewebforthispurpose. tidyr Thereareafewdifferentwaysinwhichyoucanrepresentthesametabulardataset.Inone form—calledlong,narrow,stacked,orentity-attribute-valuemodel—eachrowcontains anobservationID,avariablename,andthevalueofthatvariable.Forexample: memberattributevalue 1RingoStarrbirthyear1940 2PaulMcCartneybirthyear1942 3GeorgeHarrisonbirthyear1943 4JohnLennonbirthyear1940 5RingoStarrinstrumentDrums 6PaulMcCartneyinstrumentBass 7GeorgeHarrisoninstrumentGuitar 8JohnLennoninstrumentGuitar Inanotherform(calledwideorunstacked),eachoftheobservation’svariablesarestored ineachcolumn: memberbirthyearinstrument 1GeorgeHarrison1943Guitar 2JohnLennon1940Guitar 3PaulMcCartney1942Bass 4RingoStarr1940Drums Ifyoueverneedtoconvertbetweentheserepresentations,(whichisasomewhatcommon operation,inpractice)tidyrisyourtoolforthejob. Exercises Thefollowingareafewexercisesforyoutostrengthenyourgraspovertheconcepts learnedinthischapter: Normally,whenthereismissingdataforaquestionsuchas“Whatisyourincome?”, westronglysuspectanMNARmechanism,becauseweliveinadystopiathatequates wealthwithworth.Asaresult,theparticipantswiththelowestincomemaybe embarrassedtoanswerthatquestion.Intherelevantsection,weassumedthat becausethequestionwaspoorlywordedandwecouldaccountforwhetherEnglish wasthefirstlanguageoftheparticipant,themechanismisMAR.Ifwewerewrong aboutthisreason,anditwasreallybecausethelowerincomeparticipantswere reticenttoadmittheirincome,whatwouldthemissingdatamechanismbenow?If, however,thedifferencesinincomewerefullyexplainedbywhetherEnglishwasthe firstlanguageoftheparticipant,whatwouldthemissingdatamechanismbeinthat case? Findadatasetonthewebwithmissingdata.Whatdoesitusetodenotethatdatais missing?Thinkaboutthatdataset’smissingdatamechanism.Isthereachancethat thisdataisMNAR? Findafreelyavailablegovernmentdatasetontheweb.Readthedataset’sdescription, andthinkaboutwhatassumptionsyoumightmakeaboutthedatawhenplanninga certainanalysis.TranslatetheseintoactualcodesothatRcancheckthemforyou. Werethereanydeviationsfromyourexpectations? Whentwoautonomousindividualsdecidetovoluntarilytrade,thetransactioncanbe inbothparties’bestinterests.Doesitnecessarilyfollowthatavoluntarytrade betweennationsbenefitsbothstates?Whyorwhynot? Summary “Messydata”—nomatterwhatdefinitionyouuse—presentahugeroadblockforpeople whoworkwithdata.Thischapterfocusedontwoofthemostnotoriousandprolific culprits:missingdataanddatathathasnotbeencleanedorauditedforquality. Onthemissingdataside,youlearnedhowtovisualizemissingdatapatterns,andhowto recognizedifferenttypesofmissingdata.Yousawafewunprincipledwaysoftacklingthe problem,andlearnedwhytheyweresuboptimalsolutions.Multipleimputation,soyou learned,addressestheshortcomingsoftheseapproachesand,throughitsusageofseveral imputeddatasets,correctlycommunicatesouruncertaintysurroundingtheimputed values. Onunsanitizeddata,wesawthatthe,perhaps,optimalsolution(visuallyauditingthedata) wasuntenableformoderatelysizeddatasetsorlarger.Wediscoveredthatthegrammarof thepackageassertrprovidesamechanismtooffloadthisauditingprocesstoR.Younow haveafewassertrchecking“recipes”underyourbeltforsomeofthemorecommon manifestationsofthemistakesthatplaguedatathathasnotbeenscrutinized. Chapter12.DealingwithLargeData Inthepreviouschapter,wespokeofsolutionstocommonproblemsthatfallunderthe umbrellatermofmessydata.Inthischapter,wearegoingtosolvesomeoftheproblems relatedtoworkingwithlargedatasets. Problems,incaseofworkingwithlargedatasets,canoccurinRforafewreasons.For one,R(andmostotherlanguages,forthatmatter)wasdevelopedduringatimewhen commoditycomputersonlyhadoneprocessor/core.ThismeansthatthevanillaRcode can’texploitmultipleprocessor/multiplecores,whichcanoffersubstantialspeed-ups. AnothersalientreasonwhyRmightrunintotroubleanalyzinglargedatasetsisbecauseR requiresthedataobjectsthatitworkswithtobestoredcompletelyinRAMmemory.If yourdatasetexceedsthecapacityofyourRAM,youranalyseswillslowdowntoacrawl. Whenonethinksofproblemsrelatedtoanalyzinglargedatasets,theymaythinkofBig Data.Onecanscarcelybeinvolved(oreveninterested)inthefieldofdataanalysis withouthearingaboutbigdata.Istayawayfromthatterminthischapterfortworeasons: (a)theproblemsandtechniquesinthischapterwillstillbeapplicablelongafterthe buzzwordbeginstofadefrompublicmemory,and(b)problemsrelatedtotrulybigdata arerelativelyuncommon,andoftenrequirespecializedtoolsandknow-howthatisbeyond thescopeofthisbook. Somehavesuggestedthatthedefinitionofbigdatabedatathatistoobigtofitinyour computer’smemoryatonetime.Personally,Icallthislargedata—andnotjustbecauseI haveapenchantforsplittinghairs!Ireservethetermbigdatafordatathatissomassive thatitrequiresmanyhundredsofcomputersandspecialconsiderationinordertobestored andprocessed. Sometimes,problemsrelatedtohigh-dimensionaldataareconsideredlargedataproblems, too.Unfortunately,solvingtheseproblemsoftenrequiresabackgroundandmathematics beyondthescopeofthisbook,andwewillnotbediscussinghigh-dimensionalstatistics. ThischapterismoreaboutoptimizingtheRcodetosqueezehigherperformanceoutofit sothatcalculationsandanalyseswithlargedatasetsbecomecomputationallytractable. So,perhapsthischaptershouldmoreaptlybenamedHighPerformanceR. Unfortunately,thistitleismoreostentatious,andwouldn’tfitthenamingpattern establishedbythepreviouschapter. Eachofthetop-levelsectionsinthischapterwilldiscussaspecifictechniqueforwriting higherperformingRcode. Waittooptimize ProminentcomputerscientistandmathematicianDonaldKnuthfamouslystated: Prematureoptimizationistherootofallevil. I,personally,holdthatmoneyistherootofallevil,butprematureoptimizationis definitelyupthere! Whyisprematureoptimizationsoevil?Well,thereareafewreasons.First,programmers cansometimesbeprettybadatidentifyingwhatthebottleneckofaprogram—the routine(s)thathavetheslowestthroughput—isandoptimizethewrongpartsofa program.Identificationofbottleneckscanmostaccuratelybeperformedbyprofilingyour codeafterit’sbeencompletedinanun-optimizedform. Secondly,clevertricksandshortcutsforspeedingupcodeoftenintroducesubtlebugsand unexpectedbehavior.Now,thespeedupofthecode—ifthereisany!—mustbetakenin contextwiththetimeittooktocompletethebug-finding-and-fixingexpedition; occasionally,anetnegativeamountoftimehasbeensavedwhenallissaidanddone. Lastly,sinceprematureoptimizationliterallynecessitateswritingyourcodeinawaythat isdifferentthanyounormallywould,itcanhavedeleteriouseffectsonthereadabilityof thecodeandyourabilitytounderstanditwhenwelookbackonitaftersomeperiodof time.AccordingtoStructureandInterpretationofComputerPrograms,oneofthemost famoustextbooksincomputerscience,Programsmustbewrittenforpeopletoread,and onlyincidentallyformachinestoexecute.Thisreflectsthefactthatthebulkofthetime updatingorexpandingcodethatisalreadywrittenisspentonahumanhavingtoreadand understandthecode—notthetimeittakesforthecomputertoexecuteit.Whenyou prematurelyoptimize,youmaybecausingahugereductioninreadabilityinexchangefor amarginalgaininexecutiontime. Insummary,youshouldprobablywaittooptimizeyourcodeuntilyouaredone,andthe performanceisdemonstrablyinadequate. Usingabiggerandfastermachine Insteadofrewritingcriticalsectionsofyourcode,considerrunningthecodeonamachine withafasterprocessor,morecores,moreRAMmemory,fasterbusspeeds,and/orreduced disklatency.Thissuggestionmayseemlikeaglibcop-out,butit’snot.Sure,usinga biggermachineforyouranalyticssometimesmeansextramoney,butyourtime,dear reader,ismoneytoo.If,overthecourseofyourwork,ittakesyoumanyhourstooptimize yourcodeadequately,buyingorrentingabettermachinemayactuallyprovetobethe morecost-effectivesolution. Goingdownthisroadneedn’trequirethatyoupurchaseahigh-poweredmachine outrightly;therearenowvirtualserversthatyoucanrentonlineforfiniteperiodsoftime atreasonableprices.Someofthesevirtualserverscanbeconfiguredtohave2terabytesof RAMand40virtualprocessors.Ifyouareinterestedinlearningmoreonthisoption,look attheofferingsofDigitalOcean,AmazonElasticComputeCloud,ormanyothersimilar serviceproviders. Askyouremployerorresearchadvisorifthisisafeasibleoption.Ifyouareworkingfora non-profitwithalimitedbudget,youmaybeabletoworkoutadealwithaparticularly charitablecloudcomputingserviceprovider.Tell‘emthat‘Tony’sentyou!Butdon’t actuallydothat. Besmartaboutyourcode Inmanycases,theperformanceoftheRcodecanbegreatlyimprovedbysimple restructuringofthecode;thisdoesn’tchangetheoutputoftheprogram,justthewayitis represented.Restructuringsofthistypeareoftenreferredtoascoderefactoring.The refactoringsthatreallymakeadifferenceperformance-wiseusuallyhavetodowitheither improvedallocationofmemoryorvectorization. Allocationofmemory ReferallthewaybacktoChapter5,UsingDatatoReasonAbouttheWorld.Remember whenwecreatedamockpopulationofwomen’sheightsintheUS,andwerepeatedlytook 10,000samplesof40fromittodemonstratethesamplingdistributionofthesample means?Inacodecomment,Imentionedinpassingthatthesnippetnumeric(10000) createdanemptyvectorof10,000elements,butIneverexplainedwhywedidthat.Why didn’twejustcreateavectorof1,andcontinuallytackoneachnewsamplemeantothe endofitasfollows: set.seed(1) all.us.women<-rnorm(10000,mean=65,sd=3.5) means.of.our.samples.bad<-c(1) #I'mincreasingthenumberof #samplesto30,000toproveapoint for(iin1:30000){ a.sample<-sample(all.us.women,40) means.of.our.samples.bad[i]<-mean(a.sample) } ItturnsoutthatRstoresvectorsincontiguousaddressesinyourcomputer’smemory.This meansthateverytimeanewsamplemeangetstackedontotheendof means.of.our.samples.bad,Rhastomakesurethatthenextmemoryblockisfree.Ifit isnot,Rhastofindacontiguoussectionofmemorythancanfitalltheelements,copythe vectorover(elementbyelement),andfreethememoryintheoriginallocation.In contrast,whenwecreatedanemptyvectoroftheappropriatenumberofelements,Ronly hadtofindamemorylocationwiththerequisitenumberoffreecontiguousaddresses once. Let’sseejustwhatkindofdifferencethismakesinpractice.Wewillusethesystem.time functiontotimetheexecutiontimeofboththeapproaches: means.of.our.samples.bad<-c(1) system.time( for(iin1:30000){ a.sample<-sample(all.us.women,40) means.of.our.samples.bad[i]<-mean(a.sample) } ) means.of.our.samples.good<-numeric(30000) system.time( for(iin1:30000){ a.sample<-sample(all.us.women,40) means.of.our.samples[i]<-mean(a.sample) } ) ------------------------------------usersystemelapsed 2.0240.4312.465 usersystemelapsed 0.6780.0040.684 Althoughanelapsedtimesavingoflessthanone/twosecondsdoesn’tseemlikeabig deal,(a)itaddsup,and(b)thedifferencegetsmoreandmoredramaticasthenumberof elementsinthevectorincrease. Bytheway,thispreallocationbusinessappliestomatrices,too. Vectorization WereyouwonderingwhyRissoadamantaboutkeepingtheelementsofvectorsin adjoiningmemorylocations?Well,ifRdidn’t,thentraversingavector(likewhenyou applyafunctiontoeachelement)wouldrequirehuntingaroundthememoryspaceforthe rightelementsindifferentlocations.Havingtheelementsallinarowgivesusan enormousadvantage,performance-wise. Tofullyexploitthisvectorrepresentation,ithelpstousevectorizedfunctions—whichwe werefirstintroducedtoinChapter1,RefresheR.Thesevectorizedfunctionscall optimized/blazingly-fastCcodetooperateonvectorsinsteadofonthecomparatively slowerRcode.Forexample,let’ssaywewantedtosquareeachheightinthe all.us.womenvector.Onewaywouldbetouseafor-looptosquareeachelementas follows: system.time( for(iin1:length(all.us.women)) all.us.women[i]^2 ) -------------------------usersystemelapsed 0.0030.0000.003 Okay,notbadatall.Nowwhatifweappliedalambdasquaringfunctiontoeachelement usingsapply? system.time( sapply(all.us.women,function(x)x^2) ) ----------------------usersystemelapsed 0.0060.0000.006 Okay,that’sworse.Butwecanuseafunctionthat’slikesapplyandwhichallowsusto specifythetypeofreturnvalueinexchangeforafasterprocessingspeed: >system.time( +vapply(all.us.women,function(x)x^2,numeric(1)) +) ------------------------usersystemelapsed 0.0060.0000.005 Stillnotgreat.Finally,whatifwejustsquaretheentirevector? system.time( all.us.women^2 ) ---------------------usersystemelapsed 000 Thiswassofastthatsystem.timedidn’thavetheresolutiontodetectanyprocessingtime atall.Further,thiswayofwritingthesquaringfunctionalitywasbyfartheeasiesttoread. Themoralofthestoryistousevectorizedoptionswheneveryoucan.AllofcoreR’s arithmeticoperators(+,-,^,sqrt,log,andsoon)areofthistype.Additionally,usingthe rowSumsandcolSumsfunctionsonmatricesisfasterthanapply(A_MATRIX,1,sum)and apply(A_MATRIX,1,sum)respectively,formuchthesamereason. Speakingofmatrices,beforewemoveon,youshouldknowthatcertainmatrixoperations areblazinglyfastinR,becausetheroutinesareimplementedincompiledCand/orFortran code.Ifyoudon’tbelieveme,trywritingandtestingtheperformanceofOLSregression withoutusingmatrixmultiplication. Ifyouhavethelinearalgebraknow-how,andhavetheoptiontorewriteacomputation thatyouneedtoperformusingmatrixoperations,youshoulddefinitelytryitout. Usingoptimizedpackages ManyofthefunctionalitiesinbaseRhavealternativeimplementationsavailablein contributedpackages.Quiteoften,thesepackagesofferafasterorlessmemory-intensive substituteforthebaseRequivalent.Forexample,inadditiontoaddingatonofextra functionality,theglmnetpackageperformsregressionfarfasterthanglminmy experience. Forfasterdataimport,youmightbeabletousefreadfromthedata.tablepackageor theread_*familyoffunctionsfromthereadrpackage.Itisnotuncommonfordata importtasksthatusedtotakeseveralhourstotakeonlyafewminuteswiththeseread functions. Forcommondatamanipulationtasks—likemerging(joining),conditionalselection, sorting,andsoon—youwillfindthatthedata.tableanddplyrpackagesofferincredible speedimprovements.BothofthesepackageshaveatonofuseRsthatswearbythem,and thecommunitysupportissolid.You’dbewelladvisedtobecomeproficientinoneof thesepackageswhenyou’reready. Note Asitturnsout,thesqldfpackagethatImentionedinpassinginChapter10,Sourcesof Data—theonethatcanperformSQLqueriesondataframes—cansometimesoffer performanceimprovementsforcommondatamanipulationtasks,too.Behindthescenes, sqldf(bydefault)loadsyourdataframeintoatemporarySQLitedatabase,performsthe queryinthedatabase’sSQLexecutionenvironment,returnstheresultsfromthedatabase intheformofadataframe,anddestroysthetemporarydatabase.Sincethequeriesrunon thedatabase,sqldfcan(a)sometimesperformthequeriesfasterthantheequivalent nativeRcode,and(b)somewhatrelaxestheconstraintthatthedataobjects,whichRuses, beheldcompletelyinmemory. TheconstraintthatthedataobjectsinRmustbeabletofitintomemorycanbeareal obstacleforpeoplewhoworkwithdatasetsthatareratherlarge,butjustshyofbeingbig enoughtonecessitatespecialtools.Somecanthwartthisconstraintbystoringtheirdata objectsinadatabase,andonlyusingselectedsubsets(thatwillfitinthememory).Others cangetbyusingrandomsamplesoftheavailabledatainsteadofrequiringthewhole datasettobeheldatonce.Ifnoneoftheseoptionssoundappealing,therearepackagesin Rthatwillallowimportingdatathatislargerthanthememoryavailablebydirectly referringtothedataasit’sstoredonyourharddisk.Themostpopularoftheseseemtobe ffandbigmemory.Thereisacosttothis,however;notonlyaretheoperationsslowerthan theywouldbeiftheywereinmemory,butsincethedataisprocessedpiecemeal—in chunks—manystandardRfunctionswon’tworkonthem.Bethatasitmay,theffbase andthebiganalyticspackagesprovidemethodstorestoresomeofthefunctionalitylost forthetwopackagesrespectively.Mostnotably,thesepackagesallowffandbigmemory objectstobeusedinthebiglmpackage,whichcanbuildgeneralizedlinearmodelsusing datathatistoobigtofitinthememory. Note biglmcanalsobeusedtobuildgeneralizedlinearmodelsusingdatastoredinadatabase! RemembertheCRANTaskViewswetalkedaboutinthelastchapter?Thereisawhole TaskViewdedicatedtoHighPerformanceComputing(https://cran.rproject.org/web/views/HighPerformanceComputing.html).Ifthereisaparticularstatistical techniquethatyou’dliketofindanoptimizedalternativefor,thisisthefirstplaceI’d check. UsinganotherRimplementation Risbothalanguageandanimplementationofthatlanguage.Sofar,whenwe’vebeen talkingabouttheRenvironment/platform,we’vebeentalkingabouttheGNUProject startedbyR.IhakaandR.GentlemenattheUniversityofAucklandin1993andhostedat http://www.r-project.org.SinceRhasnostandardspecification,thiscanonical implementationservesasR’sdefactospecification.Ifaprojectisabletoimplementthis specification—andrewritetheGNU-Rfunctionality-for-functionalityandbug-for-bug— anyvalidRcodecanberunonthatimplementation. Sometimearound2009,variousotherimplementationofRstartedtocropup.Among theseareRenjin(runningontheJavaVirtualMachine),pqR(whichstandsforPretty QuickR,andwritteninamixofC,R,andFortran),FastR(whichiswritteninJava),and Riposte(whichiswrittenmainlyinC++).Thesealternativeimplementationspromise compellingimprovementstoGNU-R,suchasautomaticmultithreading(parallelization), abilitytohandlelargerdata,andtighterintegrationwithJava. Unfortunately,noneoftheseprojectsarecompleteasyet.Becauseofthis,noteverything you’dexpecthasbeenimplemented;someofyourfavoritepackagesmaystopworking, andbyandlarge,theseimplementationsaredifficulttoinstall.Forthesereasons,Iwould onlyrecommendthisforveryadvancedusersand/orfortheextremelydesperate. Althoughitdoesn’tqualifyasanotherRimplementation,thereisanotherRdistribution thatisgainingpopularity—putoutbyacommercialenterprisenamedRevolution Analytics—calledRevolutionREnterprise.Thisdistributionboastsautomatic parallelizationforcertainrewrittenfunctions,improvedabilitytoworkonandmodel datasetsthatwillnotfitinRAM(forcertainrewrittenfunctions),facilitiesfordistributed computing,andtighterintegrationwithbigdatadatabases.ThisisapaiddistributionofR, butyoucanituseforfreeifyouareastudentorforadiscountifyouworkinthenonprofitpublicservicesector. RevolutionAnalyticsalsoputsoutafreealternativedistributionofRcalledRevolutionR Open.Theprimarybenefitofthisdistribution,fromaperformanceperspective,istheease withwhichitcanbeinstalledandusedwiththehighperformanceIntelMathKernel Library(MKL).TheMKLisadrop-insubstituteforthelinearalgebralibrariesthatare bundledautomaticallywithGNU-R.WhilethelinearalgebralibrarythatshipswithGNURissingle-threaded,theMKLcanexploitmultiplecorestransparently.Thismakes computationslikematrixdecomposition,matrixinversion,andvectorizedmath(very commonwhetherexplicitlyusedornot)muchfaster. Beforewegoon,itshouldbenotedthatyoudon’thavetouseRevolutionROpentotake advantageoftheMKLoranyothermulti-threadedlinearalgebralibrarieslike OpenBLAS,ATLAS,andAccelerate(whichcomeswithOSXandisMaconly)—Idon’t. However,linkingGNU-Rwiththeseotherlibrariescansometimesgetmessyandrequires care.Interestedreaderscanfindinstructionsonhowtodothislinkingontheweb,mostly intheformofblogpostsfromRenthusiasts. Note TheMacintoshversionofRevolutionROpen,bydefault,integrateswiththemultithreadedAccelerateframework,insteadofMKL. Useparallelization Aswesawinthischapter’sintroduction,oneofthelimitationsofR(andmostother programminglanguages)wasthatitwascreatedbeforecommoditypersonalcomputers hadmorethanoneprocessororcore.Asaresult,bydefault,Rrunsonlyoneprocessand, thus,makesuseofoneprocessor/coreatatime. IfyouhavemorethanonecoreonyourCPU,itmeansthatwhenyouleaveyourcomputer aloneforafewhoursduringalongrunningcomputation,yourRtaskisrunningonone corewhiletheothersareidle.Clearlythisisnotideal;ifyourRtasktookadvantageofall theavailableprocessingpower,youcangetmassivespeedimprovements. Parallelcomputation(ofthetypewe’llbeusing)worksbystartingmultipleprocessesat thesametime.Theoperatingsystemthenassignseachoftheseprocessestoaparticular CPU.Whenmultipleprocessesrunatthesametime,thetimetocompletionisonlyas longasthelongestprocess,asopposedtothetimetocompletealltheprocessesadded together. Figure12.1:diagramoftheparallelizationandtheresultantreducedtimetocompletion Forexample,let’ssaywehavefourprocessesinataskthattakes1secondtocomplete. Withoutusingparallelization,thetaskwouldtake4seconds,butwithparallelizationon fourcores,thetaskwouldtake1second. Note Awordofwarning:Thisistheidealscenario;butinpractice,thecostofstartingmultiple processesconstitutesanoverheadthatwillresultinthetimetocompletionnotscaling linearlywiththenumberofcoresused. Allthissoundsgreat,butthere’sanimportantcatch;eachprocesshastobeabletorun independentoftheoutputoftheotherprocesses.Forexample,ifwewroteanRprogram tocomputethenthnumberintheFibonaccisequence,wecouldn’tdividethattaskupinto smallerprocessestoruninparallel,becausethenFibonaccinumberdependsonwhatwe computeasthen-1thFibonaccinumber(andsoon,adinfinitum).Theparallelizationof thetypewe’llbeusinginthischapteronlyworksonproblemsthatcanbesplitupinto processes,suchthattheprocessesdon’tdependoneachotherandthere’sno communicationbetweenprocesses.Luckily,therearemanyproblemslikethisindata analysis!Almostasluckily,Rmakesiteasytouseparallelizationonproblemsofthis type! Problemsofthenaturethatwejustdescribedaresometimesknownasembarrassingly parallelproblems,becausetheentiretaskcanbebrokendownintoindependent componentsveryeasily.Asanexample,summingthenumbersinanumericvectorof100 elementsisanembarrassinglyparallelproblem,becausewecaneasilysumthefirst50 elementsinoneprocessandthelast50inanother,inparallel,andjustaddthetwo numbersattheendtogetthefinalsum.Thepatternofcomputationwejustdescribedis sometimesreferredtoassplit-apply-combine,divideandconquer,ormap/reduce. Note Usingparallelizationtotackletheproblemofsumming100numbersissilly,sincethe overheadofthesplittingandcombiningwilltakelongerthanitwouldtojustsumupall the100elementsserially.Also,sumisalreadyreallyfastandvectorized. GettingstartedwithparallelR GettingstartedwithparallelizationinRrequiresminimalsetup,butthatsetupvariesfrom platformtoplatform.Moreaccurately,thesetupisdifferentforWindowsthanitisfor everyotheroperatingsystemthatRrunson(GNU/Linux,MacOSX,Solaris,*BSD,and others). Ifyouhavedon’thaveaWindowscomputer,allyouhavetodotostartistoloadthe parallelpackage: #Youdon'thavetoinstallthisifyourcopyofRisnew library(parallel) IfyouuseWindows,youcaneither(a)switchtothefreeoperatingsystemthatover97 percentofthe500mostpowerfulsupercomputersintheworlduse,or(b)runthe followingsetupcode: library(parallel) cl<-makeCluster(4) Youmayreplacethe4withhowevermanyprocessesyouwanttoautomaticallysplityour taskinto.Thisisusuallysettothenumberofcoresavailableonyourcomputer.Youcan queryyoursystemforthenumberofavailablecoreswiththefollowingincantation: detectCores() -----------------------[1]4 Ourfirstsilly(butdemonstrative)applicationofparallelizationisthetaskofsleeping (makingaprogrambecometemporarilyinactive)for5seconds,fourdifferenttimes.We candothisserially(not-parallel)asfollows: for(iin1:4){ Sys.sleep(5) } Or,equivalently,usinglapply: #lapplywillpasseachelementofthe #vectorc(1,2,3,4)tothefunction #wewritebutwe'llignoreit lapply(1:4,function(i)Sys.sleep(5)) Let’stimehowlongthistasktakestocompletebywrappingthetaskinsidetheargument tothesystem.timefunction: system.time( lapply(1:4,function(i)Sys.sleep(5)) ) ---------------------------------------usersystemelapsed 0.0590.07420.005 Unsurprisingly,ittook20(4*5)secondstorun.Let’sseewhathappenswhenwerunthis inparallel: ####################### #NON-WINDOWSVERSION# ####################### system.time( mclapply(1:4,function(i)Sys.sleep(5),mc.cores=4) ) ################### #WINDOWSVERSION# ################### system.time( parLapply(cl,1:4,function(i)Sys.sleep(5)) ) ---------------------------------------usersystemelapsed 0.0210.0425.013 Checkthatout!5seconds!Justwhatyouwouldexpectiffourprocessesweresleepingfor 5secondsatthesametime! Forthenon-windowscode,wesimplyusethemclapply(thenon-Windowsparallel counterparttolapply)insteadoflapply,andpassinanotherargumentnamedmc.cores, whichtellsmclapplyhowmanyprocessestoautomaticallysplittheindependent computationinto. Forthewindowscode,weuseparLapply(theWindowsparallelcounterparttolapply). TheonlydifferencebetweenlapplyandparLapplythatwe’veusedhereisthat parLapplytakestheclusterwemadewiththemakeClustersetupfunctionasitsfirst argument.Unlikemclapply,there’snoneedtospecifythenumberofcorestouse,since theclusterisalreadysetuptotheappropriatenumberofcores. Note BeforeRgotthebuilt-inparallelpackage,thetwomainpackagesthatallowedfor parallelizationweremulticoreandsnow.multicoreusedamethodofcreatingdifferent processescalledforkingthatwassupportedonallR-runningOSsexceptWindows. Windowsusersusedthemoregeneralsnowpackagetoachieveparallelization.snow, whichstandsforSimpleNetworkofWorkstations,notonlyworksonnon-Windows computersaswellbutalsoonaclusterofdifferentcomputerswithidenticalR installations.multicoredidnotsupportclustercomputingacrossphysicalmachineslike snowdoes. SinceRversion2.14,thefunctionalityofboththemulticoreandsnowpackageshave essentiallybeenmergedintotheparallelpackage.Themulticorepackagehassince beenremovedfromCRAN. Fromnowon,whenwerefertotheWindowscounterparttoX,knowthatwereallymean thesnowcounterparttoX,becausethefunctionsofsnowwillworkonnon-WindowsOSs andclustersofmachines.Similarly,bythenon-Windowscounterparts,wereallymeanthe counterpartscannibalizedfromthemulticorepackage. Youwouldask,Whydon’twejustalwaysusethesnowfunctions?Ifyouhavetheoption tousethemulticore/forkingparallelism(youarerunningprocessesonjustonenonWindowsphysicalmachine),themulticoreparallelismtendstobelight-weight.For example,sometimesthecreationofasnowclusterwithmakeClustercansetofffirewall alerts.Itissafetoallowtheseconnections,bytheway. Anexampleof(some)substance Forourfirstrealapplicationofparallelization,wewillbesolvingaproblemthatisloosely basedonarealproblemthatIhadtosolveduringthecourseofmywork.Inthis formulation,wewillbeimportinganopendatasetfromthewebthatcontainstheairport code,latitudecoordinates,andlongitudecoordinatesfor13,429USairports.Ourtaskwill betofindtheaverage(mean)distancefromeveryairporttoeveryotherairport.For example,ifLAX,ALB,OLM,andJFKweretheonlyextantairports,wewouldcalculate thedistancesbetweenJFKtoOLM,JFKtoALB,JFKtoLAX,OLMtoALB,OLMto LAX,andALBtoLAX,andtakethearithmeticmeanofthesedistances. Whyarewedoingthis?Besidesthefactthatitwasinspiredbyanactual,reallifeproblem —andthatIcoveredthisveryprobleminnofewerthanthreeblogposts—thisproblemis perfectforparallelizationfortworeasons: Itisembarrassinglyparallel—Thisproblemisveryamenabletosplitting-applyingand-combining(ormap/reduction);eachprocesscantakeafew(severalhundreds, really)oftheairport-to-airportcombinations,theresultscanthenbesummedand dividedbythenumberofdistancecalculationsperformed. Itexhibitscombinatorialexplosion—Thetermcombinatorialexplosionrefersto theproblemsthatgrowveryquicklyinsizeorcomplexityduetotheroleof combinatoricsintheproblem’ssolution.Forexample,thenumberofdistance calculationswehavetoperformexhibitspolynomialgrowthasafunctionofthe numberofairportsweuse.Inparticular,thenumberofdifferentcalculationsisgiven bythebinomialcoefficient, ,orn(n-1)/2.100airportsrequire4,950distance calculations;all13,429airportsrequire90,162,306distancecalculations.Problemsof thistypeusuallyrequiretechniqueslikethosediscussedinthischapterinordertobe computationallytractable. Note Thebirthdayproblem:Mostpeopleareunfazedbythefactthatittakesaroomof367to guaranteethattwopeopleintheroomhavethesamebirthday.Manypeoplearesurprised, however,whenitisrevealedthatitonlyrequiresaroomfullof23peoplefortheretobea 50percentchanceoftwopeoplesharingthesamebirthday(assumingthatbirthdaysoccur oneachdaywithequalprobability).Further,itonlytakesaroomfullof60fortheretobe overa99percentchancethatapairwillshareabirthday.Ifthissurprisesyoutoo, considerthatthenumberofpairsofpeoplethatcouldpossiblysharetheirbirthdaygrows polynomiallywiththenumberofpeopleintheroom.Infact,thenumberofpairsthatcan shareabirthdaygrowsjustlikeourairportproblem—thenthenumberofbirthdaypairsis exactlythenumberofdistancecalculationswewouldhavetoperformifthepeoplewere airports. First,let’swritethefunctiontocomputethedistancebetweentwolatitude/longitudepairs. SincetheEarthisn’tflat(strictlyspeaking,it’snotevenaperfectsphere),thedistance betweenthelongitudeandlatitudedegreesisnotconstant—meaning,youcan’tjusttake theEuclideandistancebetweenthetwopoints.WewillbeusingtheHaversineformula forthedistancesbetweenthetwopoints.TheHaversineformulaisgivenasfollows: whereϕandλarethelatitudeandlongituderespectively,ristheEarth’sradius,andΔis thedifferencebetweenthetwolatitudesorlongitudes. haversine<-function(lat1,long1,lat2,long2,unit="km"){ radius<-6378#radiusofEarthinkilometers delta.phi<-to.radians(lat2-lat1) delta.lambda<-to.radians(long2-long1) phi1<-to.radians(lat1) phi2<-to.radians(lat2) term1<-sin(delta.phi/2)^2 term2<-cos(phi1)*cos(phi2)*sin(delta.lambda/2)^2 the.terms<-term1+term2 delta.sigma<-2*atan2(sqrt(the.terms),sqrt(1-the.terms)) distance<-radius*delta.sigma if(unit=="km")return(distance) if(unit=="miles")return(0.621371*distance) } Everythingmustbemeasuredinradians(notdegrees),solet’smakeahelperfunctionfor conversiontoradians,too: to.radians<-function(degrees){ degrees*pi/180 } Nowlet’sloadthedatasetfromtheweb.Sinceit’sfromanoutsidesourceanditmightbe messy,thisisanexcellentchancetouseourassertrchopstomakesuretheforeigndata setmatchesourexpectations:thedatasetis13,429observationslong,ithasthreenamed columns,thelatitudeshouldbe90orbelow,andthelongitudeshouldbe180orbelow. We’llalsojuststartwithasubsetofalltheairports.Becausewearegoingtobetakinga randomsampleofalltheobservations,we’llsettherandomnumbergeneratorseedsothat mycalculationswillalignwithyours,dearreader. set.seed(1) the.url<-"http://opendata.socrata.com/api/views/rxrh-4cxm/rows.csv? accessType=DOWNLOAD" all.airport.locs<-read.csv(the.url,stringsAsFactors=FALSE) library(magrittr) library(assertr) CHECKS<-.%>% verify(nrow(.)==13429)%>% verify(names(.)%in%c("locationID","Latitude","Longitude"))%>% assert(within_bounds(0,90),Latitude)%>% assert(within_bounds(0,180),Longitude) all.airport.locs<-CHECKS(all.airport.locs) #Let'sstartoffwith400airports smp.size<-400 #choosearandomsampleofairports random.sample<-sample((1:nrow(all.airport.locs)),smp.size) airport.locs<-all.airport.locs[random.sample,] row.names(airport.locs)<-NULL head(airport.locs) ------------------------------------locationIDLatitudeLongitude 1LWV38.764287.6056 2LS7730.727291.1486 32N243.591971.7514 4VG0037.369775.9469 Nowlet’swriteafunctioncalledsingle.corethatcomputestheaveragedistance betweeneverytwopairsofairportsnotusinganyparallelcomputation.Foreachlat/long pair,weneedtofindthedistancebetweenitandtherestofthelat/longspairs.Sincethe distancebetweenpointaandbisthesameasthedistancebetweenbanda,foreveryrow, weneedonlycomputethedistancebetweenitandtheremainingrowsinthe airport.locsdataframe: single.core<-function(airport.locs){ running.sum<-0 for(iin1:(nrow(airport.locs)-1)){ for(jin(i+1):nrow(airport.locs)){ #iistherowofthefirstlat/longpair #jistherowofthesecondlat/longpair this.dist<-haversine(airport.locs[i,2], airport.locs[i,3], airport.locs[j,2], airport.locs[j,3]) running.sum<-running.sum+this.dist } } #Nowwehavetodividebythenumberof #distanceswetook.Thisisgivenby return(running.sum/ ((nrow(airport.locs)*(nrow(airport.locs)-1))/2)) } Now,let'stimeit! system.time(ave.dist<-single.core(airport.locs)) print(ave.dist) ---------------------------- usersystemelapsed 5.4000.0345.466 [1]1667.186 Allright,5andahalfsecondsfor400airports. Inordertousetheparallelsurrogatesforlapply,let’srewritethefunctiontouselapply. Observetheoutputofthefollowingincantation: #We'llhavetolimittheoutputtothe #first11columns combn(1:10,2)[,1:11] ---------------------------------------[,1][,2][,3][,4][,5][,6][,7][,8][,9] [1,]111111111 [2,]2345678910 [,10][,11] [1,]22 [2,]34 Theprecedingfunctionusedthecombnfunctiontocreateamatrixthatcontainsallpairsof twonumbersfrom1to10,storedascolumnsintworows.Ifweusethecombnfunction withavectorofintegernumbersfrom1ton(wherenisthenumberofairportsinour dataframe),eachcolumnoftheresultantmatrixwillrefertoallthedifferentindiceswith whichtoindextheairportdataframeinordertoobtainallthepossiblepairsofairports. Forexample,let’sgobacktotheworldwhereLAX,ALB,OLM,andJFKweretheonly extantairports;considerthefollowing: small.world<-c("LAX","ALB","OLM","JFK") all.combs<-combn(1:length(small.world),2) for(iin1:ncol(all.combs)){ from<-small.world[all.combs[1,i]] to<-small.world[all.combs[2,i]] print(paste(from,"<->",to)) } ---------------------------------------[1]"LAX<->ALB" [1]"LAX<->OLM" [1]"LAX<->JFK" [1]"ALB<->OLM"#backtoolympia [1]"ALB<->JFK" [1]"OLM<->JFK" Formulatingoursolutionaroundthismatrixofindices,wecanuselapplytoloopoverthe columnsinthematrix: small.world<-c("LAX","ALB","OLM","JFK") all.combs<-combn(1:length(small.world),2) #insteadofprintingeachairportpairinastring, #we'llreturnthestring results<-lapply(1:ncol(all.combs),function(x){ from<-small.world[all.combs[1,x]] to<-small.world[all.combs[2,x]] return(paste(from,"<->",to)) }) print(results) ------------------------[[1]] [1]"LAX<->ALB" [[2]] [1]"LAX<->OLM" [[3]] [1]"LAX<->JFK" ........ Inourproblem,wewillbereturningnumericsfromtheanonymousfunctioninlapply. However,becauseweareusinglapply,theresultswillbealist.Becausewecan’tcall sumonalistofnumerics,wewillusetheunlistfunctiontoturnthelistintoavector. unlist(results) --------------------[1]"LAX<->ALB""LAX<->OLM""LAX<->JFK" [4]"ALB<->OLM""ALB<->JFK""OLM<->JFK" Wehaveeverythingweneedtorewritethesingle.corefunctionusinglapply. single.core.lapply<-function(airport.locs){ all.combs<-combn(1:nrow(airport.locs),2) numcombs<-ncol(all.combs) results<-lapply(1:numcombs,function(x){ lat1<-airport.locs[all.combs[1,x],2] long1<-airport.locs[all.combs[1,x],3] lat2<-airport.locs[all.combs[2,x],2] long2<-airport.locs[all.combs[2,x],3] return(haversine(lat1,long1,lat2,long2)) }) return(sum(unlist(results))/numcombs) } system.time(ave.dist<-single.core.lapply(airport.locs)) print(ave.dist) --------------------------------------usersystemelapsed 5.8900.0425.968 [1]1667.186 Thisparticularsolutionisalittlebitslowerthanoursolutionwiththedoubleforloops, butit’sabouttopayenormousdividends;nowwecanuseoneoftheparallelsurrogates forlapplytosolvetheproblem: ####################### #NON-WINDOWSVERSION# ####################### multi.core<-function(airport.locs){ all.combs<-combn(1:nrow(airport.locs),2) numcombs<-ncol(all.combs) results<-mclapply(1:numcombs,function(x){ lat1<-airport.locs[all.combs[1,x],2] long1<-airport.locs[all.combs[1,x],3] lat2<-airport.locs[all.combs[2,x],2] long2<-airport.locs[all.combs[2,x],3] return(haversine(lat1,long1,lat2,long2)) },mc.cores=4) return(sum(unlist(results))/numcombs) } ################### #WINDOWSVERSION# ################### clusterExport(cl,c("haversine","to.radians")) multi.core<-function(airport.locs){ all.combs<-combn(1:nrow(airport.locs),2) numcombs<-ncol(all.combs) results<-parLapply(cl,1:numcombs,function(x){ lat1<-airport.locs[all.combs[1,x],2] long1<-airport.locs[all.combs[1,x],3] lat2<-airport.locs[all.combs[2,x],2] long2<-airport.locs[all.combs[2,x],3] return(haversine(lat1,long1,lat2,long2)) }) return(sum(unlist(results))/numcombs) } system.time(ave.dist<-multi.core(airport.locs)) print(ave.dist) ------------------------------usersystemelapsed 7.3630.2402.743 [1]1667.186 Beforeweinterprettheoutput,directyourattentiontothefirstlineoftheWindows segment.Whenmclapplycreatesadditionalprocesses,theseprocessessharethememory withtheparentprocess,andhaveaccesstoalltheparent’senvironment.WithparLapply, however,theprocedurethatspawnsnewprocessesisalittledifferentandrequiresthatwe manuallyexportallthefunctionsandlibrariesweneedtoloadontoeachnewprocess beforehand.Inthisexample,weneedthenewworkerstohavethehaversineand to.radiansfunctions. Nowtotheoutputofthelastcodesnippet.OnmyMacintoshmachinewithfourcores,this bringswhatoncewasa5.5secondaffairdowntoa2.7secondaffair.Thismaynotseem likeabigdeal,butwhenweexpandandstarttoincludemorethanjust400airports,we starttoseethemulticoreversionreallypayoff. Todemonstratejustwhatwe’vegainedfromourhasslesinparallelizingtheproblem,Iran thisonaGNU/Linuxcloudserverwith16cores,andrecordedthetimeittooktocomplete thecalculationsfordifferentsamplesizeswith1,2,4,8,and16cores.Theresultsare depictedinthefollowingimage: Figure12.2:Therunningtimesfortheaverage-distance-between-all-airportstaskat differentsamplesizesfor1,2,4,8,and16cores.Forreference,thedashedlineisthe4 coreperformancecurve,thetopmostcurveisthesinglecoreperformancecurve,andthe bottommostcurveisthe16corecurve. Itmaybehardtotellfromtheplot,buttheestimatedtimestocompletionforthetask runningon1,2,4,8,and16coresare2.4hours,1.2hours,36minutes,19minutes,and 17minutesrespectively.UsingparallelizedRona4-coremachine—whichisnotan uncommonsetupatthetimeofwriting—hasbeenabletoshaveafulltwohoursofthe task’srunningtime!Notethediminishingmarginalreturnsonthenumberofcoresused; thereisbarelyanydifferencebetweentheperformancesofthe8and16cores.C’estlavie. UsingRcpp ContrarytowhatIsometimesliketobelieve,thereareothercomputerprogramming languagesthanjustR.R—andlanguageslikePython,Perl,andRuby—areconsidered high-levellanguages,becausetheyofferagreaterlevelofabstractionfromcomputer representationsandresourcemanagementthanthelower-levellanguages.Forexample,in somelowerlevellanguages,youmustspecifythedatatypeofthevariablesyoucreateand managetheallocationofRAMmanually—C,C++,andFortranareofthistype. ThehighlevelofabstractionRprovidesallowsustodoamazingthingsveryquickly— likeimportadataset,runalinearmodel,andplotthedataandregressionlineinnomore than4linesofcode!Ontheotherhand,nothingquitebeatstheperformanceofcarefully craftedlower-levelcode.Evenso,itwouldtakehundredsoflinesofcodetorunalinear modelinalow-levellanguage,soalanguagelikethatisinappropriateforagileanalytics. OnesolutionistouseRabstractionswhenwecan,andbeabletogetdowntolower-level programmingwhereitcanreallymakealargedifference.Thereareafewpathsfor connectingRandlower-levellanguages,buttheeasiestwaybyfaristocombineRand C++withRcpp. Note Therearedifferencesinwhatisconsideredhigh-level.Forthisreason,youwillsometimes seepeopleandtexts(mostlyoldertexts)refertoCandC++asahigh-levellanguage.The samepeoplemayconsiderR,Python,andsoonasveryhigh-levellanguages.Therefore, thelevelofalanguageissomewhatrelative. Awordofwarningbeforewegoon:Thisisanadvancedtopic,andthissectionwill(out ofnecessity)glossoversome(most)ofthefinerdetailsofC++andRcpp.Ifyou’re wonderingwhetheradetailedreadingwillpayoff,it’sworthtakingapeekatthe conclusionofthissectiontoseehowmanysecondsittooktocompletetheaveragedistance-between-all-airportstaskthatwouldhavetakenover2hourstocomplete unoptimized. Ifyoudecidetocontinue,youmustinstallaC++compiler.OnGNU/Linuxthisisusually donethroughthesystem’spackagemanager.OnMacOSX,XCodemustbeinstalled;itis availablefreeintheAppStore.ForWindows,youmustinstalltheRtoolsavailableat http://cran.r-project.org/bin/windows/Rtools/.Finally,allusersneedtoinstalltheRcpp package.Formoreinformation,consultsections1.2and1.3oftheRcppFAQ (http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf). Essentially,ourintegrationofRandC++isgoingtotaketheformofusrewritingcertain functionsininC++,andcallingtheminR.Rcppmakesthisveryeasy;beforewediscuss howtowriteC++code,let’slookatanexample.Putthefollowingcodeintoafile,and nameitour_cpp_function.cpp: #include<Rcpp.h> //[[Rcpp::export]] doublesquare(doublenumber){ return(pow(number,2)); } Congratulations,you’vejustwrittenaC++program!Now,fromR,we’llreadtheC++ file,andmakethefunctionavailabletoR.Then,we’lltestoutournewfunction. library(Rcpp) sourceCpp("our_cpp_functions.cpp") square(3) -------------------------------[1]9 Thefirsttwolineswithtexthavenothingtodowithourfunction,perse.Thefirstlineis necessaryforC++tointegratewithR.Thesecondline(//[[Rcpp::export]])tellsR thatwewantthefunctiondirectlybelowittobeavailableforuse(exported)withinR. Functionsthataren’texportedcanonlybeusedintheC++file,internally. Note The//isacommentinC++,anditworksjustlike#inR.C++alsohasanothertypeof commentthatcanspanmultiplelines.Thesemultilinecommentsstartwith/*andend with*/. Throughoutthissection,we’llbeaddingfunctionstoour_cpp_functions.cppandresourcingthefilefromRtousethenewC++functions. Thefollowingmodestsquarefunctioncanteachusalotaboutthedifferencesbetweenthe C++codeandRcode.Forexample,theprecedingC++functionisroughlyequivalentto thefollowinginR: square<-function(number){ return(number^2) } Thetwodoublesdenotethatthereturnvalueandtheargumentrespectively,arebothof datatypedouble.doublestandsfordoubleprecisionfloatingpointnumber,whichis roughlyequivalenttoR’smoregeneralnumericdatatype. Thesecondthingtonoticeisthatweraisenumberstopowersusingthepowfunction, insteadofusingthe^operator,likeinR.Thisisaminorsyntacticaldifference.Thethird thingtonoteisthateachstatementinC++endswithasemicolon. Believeitornot,wenowhaveenoughknowledgetorewritetheto.radiansfunctionin C++. /*Addthis(andallothersnippetsthat startwith"//[[Rcpp::export]]") totheC++file,nottheRcode.*/ //[[Rcpp::export]] doubleto_radians_cpp(doubledegrees){ return(degrees*3.141593/180); } #withgoeswithourRcode sourceCpp("our_cpp_functions.cpp") to_radians_cpp(10) ------------------------[1]0.174533 Incredibly,withthehelpofsomesearch-engine-fuoragoodC++reference,wecan rewritethewholehaversinefunctioninC++asfollows: //[[Rcpp::export]] doublehaversine_cpp(doublelat1,doublelong1, doublelat2,doublelong2, std::stringunit="km"){ intradius=6378; doubledelta_phi=to_radians_cpp(lat2-lat1); doubledelta_lambda=to_radians_cpp(long2-long1); doublephi1=to_radians_cpp(lat1); doublephi2=to_radians_cpp(lat2); doubleterm1=pow(sin(delta_phi/2),2); doubleterm2=cos(phi1)*cos(phi2) term2=term2*pow(sin(delta_lambda/2),2); doublethe_terms=term1+term2; doubledelta_sigma=2*atan2(sqrt(the_terms), sqrt(1-the_terms)); doubledistance=radius*delta_sigma; /*ifitisanything*but*kmitismiles*/ if(unit!="km"){ return(distance*0.621371); } return(distance); } Now,let’sre-sourceit,andtestit… sourceCpp("our_cpp_functions.cpp") haversine(51.88,176.65,56.94,154.18) haversine_cpp(51.88,176.65,56.94,154.18) ---------------------------------------------[1]1552.079 [1]1552.079 AreyousurprisedtoseethatRandtheC++aresosimilar? Theonlythingsthatareunfamiliarinthisnewfunctionarethefollowing: theintdatatype(whichjustholdsaninteger) thestd::stringdatatype(whichholdsastring,oracharactervector,inR parlance) theifstatement(whichisidenticaltoR’s) Otherthanthosethings,thisisjustbuildinguponwhatwe’vealreadylearnedwiththefirst function. Ourlastmatterofbusinessistorewritethesingle.corefunctioninC++.Tobuildupto that,let’sfirstwriteaC++functioncalledsum2thattakesanumericvectorandreturnsthe sumofallthenumbers: //[[Rcpp::export]] doublesum2(Rcpp::NumericVectora_vector){ doublerunning_sum=0; intlength=a_vector.size(); for(inti=0;i<length;i++){ running_sum=running_sum+a_vector(i); } return(running_sum); } Thereareafewnewthingsinthisfunction: Wehavetospecifythedatatypeofallthevariables(includingfunctionarguments)in C++,butwhat’sthedatatypeoftheRvectorthatwe’retopassintosum2?The importstatementatthetopoftheC++fileallowsustousetheRcpp::NumericVector datatype(whichdoesnotexistinstandardC++). TogetthelengthofaNumericVector(likewewouldinRwiththelengthfunction), weusethe.size()method. TheC++forloopisalittledifferentthanitsRcounterpart.Towit,ittakesthree fields,separatedbysemicolons;thefirstfieldinitializesacountervariable,the secondfieldspecifiestheconditionsunderwhichtheforloopwillcontinue(we’ll stopiteratingwhenourcounterindexisthelengthofthevector),andthethirdishow weupdatethecounterfromiterationtoiteration(i++meansadd1toi).Allinall, thisforloopisequivalenttoaforloopinRthatstartswithfor(iin1:length). ThewaytosubscriptavectorinC++isbyusingparentheses,notbrackets.Wewill alsobeusingparentheseswhenwestartsubscriptingmatrices. Ateveryiteration,weusethecounterasanindexintotheNumericVector,andextractthe currentelement,weupdatetherunningsumwiththecurrentelement,andwhentheloop ends,wereturntherunningsum. PleasenotebeforewegoonthatthefirstelementofanyvectorinC++isthe0thelement, notthefirst.Forexample,thethirdelementofavectorcalledvictorisvictor[3]inR, whereasitwouldbevictor(2)inC++.Thisiswhythesecondfieldoftheforloopisi< lengthandnoti<=length. Now,we’refinallyreadytorewritethesingle.corefunctionfromthelastsectionin C++! //[[Rcpp::export]] doublesingle_core_cpp(Rcpp::NumericMatrixmat){ intnrows=mat.nrow(); intnumcomps=nrows*(nrows-1)/2; doublerunning_sum=0; for(inti=0;i<nrows;i++){ for(intj=i+1;j<nrows;j++){ doublethis_dist=haversine_cpp(mat(i,0),mat(i,1), mat(j,0),mat(j,1)); running_sum=running_sum+this_dist; } } returnrunning_sum/numcomps; } Nothinghereshouldbetoonew.Theonlytwonewcomponentsarethatwearetakinga newdatatype,aRcpp::NumericMatrix,asanargument,andthatweareusing.nrow()to getthenumberofrowsinamatrix. Let’stryitout!WhenweusedtheRfunctionsingle.core,wecalleditwiththewhole airportdata.frameasanargument.ButsincetheC++functiontakesamatrixof latitude/longitudepairs,wewillsimplydropthefirstcolumn(holdingtheairportname) fromtheairport.locsdataframe,andconvertwhat’sleftintoamatrix. sourceCpp("our_cpp_functions.cpp") the.matrix<-as.matrix(all.airport.locs[,-1]) system.time(ave.dist<-single_core_cpp(the.matrix)) print(ave.dist) ---------------------------------------usersystemelapsed 0.0120.0000.012 [1]1667.186 Okay,thetaskthatusedtotake5.5secondsnowtakeslessthanonetenthofasecond(and theoutputsmatch,toboot!)Astoundingly,wecanperformthetaskonallthe13,429 airportsquiteeasilynow: the.matrix<-as.matrix(all.airport.locs[,-1]) system.time(ave.dist<-single_core_cpp(the.matrix)) print(ave.dist) ------------------------------usersystemelapsed 12.3100.08012.505 [1]1869.744 UsingRcpp,ittakesamere12.5secondstocalculateandaverage90,162,306distances— afeatthatwouldhavetakenevena16coreserver17minutestocomplete. Besmarteraboutyourcode InablogpostthatIpennedshowcasingtheperformanceofthistaskundervarious optimizationmethods,Itookitforgrantedthatcalculatingthedistancesonthefulldataset withtheunparallelized/un-Rcpp-edcodewouldbeamulti-houraffair—butIwas seriouslymistaken. Shortlyafterpublishingthepost,acleverRprogrammercommentedonitstatingthatthey wereabletoslightlyreworkthecodesothattheserial/pure-Rcodetooklessthan20 secondstocompletewithallthe13,429observations.How?Vectorization. single.core.improved<-function(airport.locs){ numrows<-nrow(airport.locs) running.sum<-0 for(iin1:(numrows-1)){ this.dist<-sum(haversine(airport.locs[i,2], airport.locs[i,3], airport.locs[(i+1):numrows,2], airport.locs[(i+1):numrows,3])) running.sum<-running.sum+this.dist } return(running.sum/(numrows*(numrows-1)/2)) } system.time(ave.dist<-single.core.improved(all.airport.locs)) print(ave.dist) -----------------------------------------------------------------usersystemelapsed 15.5370.17315.866 [1]1869.744 Noteven16seconds.It’sworthfollowingwhatthiscodeisdoing. Thereisonlyoneforloopthatismakingitsroundsdownthenumberofrowsinthe airport.locsdataframe.Oneachiterationoftheforloop,itcallsthehaversine functionjustonce.Thefirsttwoargumentsarethelatitudeandlongitudeoftherowthat theloopison.Thethirdandfourtharguments,however,arethevectorsofthelatitudes andlongitudesbelowthecurrentrow.Thisreturnsavectorofallthedistancesfromthe currentairporttotheairportsbelowitinthedataset.Sincethehaversinefunctioncould justaseasilytakevectorsinsteadofsinglenumbers,thereisnoneedforasecondforloop. Sothehaversinefunctionwasalreadyvectorized,Ijustdidn’trealizeit.You’dthinkthat thiswouldbeembarrassingforsomeonewhoprofessestoknowenoughaboutRtowritea bookaboutit.Perhapsitshouldbe.ButIfoundoutthatoneofthebestwaystolearn —especiallyaboutcodeoptimization—isthroughexperimentationandmakingmistakes. Forexample,whenIstartedlearningaboutwritinghighperformanceRcodeforbothfun andprofit,Imadequiteafewmistakes.Oneofmyfirstblunders/failedexperimentswas withthisverytask;whenIfirstlearnedaboutRcpp,Iusedittotranslatetheto.radians andhaversinefunctionsonly.HavingtheloopremaininRprovedtoonlygiveaslight performanceedge—nothingcomparedtothe12.5secondbusinesswe’veachieved together.NowIknowthatthebulkoftheperformancedegradationwasduetothemillions offunctioncallstohaversine—nottheactualcomputationinthehaversinefunction. Youcouldlearnthatandotherlessonsmosteffectivelybycontinuingtotryandmessing uponyourown. Themoralofthestory:whenyouthinkyou’vevectorizedyourcodeenough,find someonesmarterthanyoutotellyouthatyou’rewrong. Exercises Practicethefollowingexercisestorevisetheconceptslearnedsofar: Ismultipleimputationamenabletoparallelcomputation?Whyorwhynot? Howisthewaywecallto.radianswasteful?Isthereanywaytorefactorourcode touseto.radiansinamoreefficientway? WhenIwasgatheringthedatafromFigure12.2,Ididn’tcheckeverysamplesize from1tothefulldataset;yet,I’veobtainedasmoothcurve.WhatIdidwastestthe performanceofahandfulofsamplesizesfrom100toonly2,000.ThenIusednls (non-linearleastsquares)tofitanequationoftheform (wherenisthesample size)tothedatapoints,andextrapolatedwiththisequationaftersolvingforx.What aresomebenefitsanddrawbacksofthisapproach?Dothisonyourownmachine,if applicable.Doyourperformancecurvesmatchmine? ThereisathoughtamongsomescholarsthatthereisanincongruencebetweenAdam Smith’stwoSeminalWorks,TheWealthofNationsandTheTheoryofMoral Sentiments,namelythatthepreoccupationofself-interestoftheformerisatodds withthestressplacedontheroleofwhatSmithreferredtoassympathy(caringfor thewell-beingofothers)inguidingmoraljudgmentsinthelatter.Whyarethese scholarswrong? Summary Webeganthischapterbyexplainingsomeofthereasonswhylargedatasetssometimes presentaproblemforunoptimizedRcode,suchasnoauto-parallelizationandnonative supportforout-of-memorydata.Fortherestofthechapterwediscussedspecificroutesto optimizingRcodeinordertotacklelargedata. First,youlearnedofthedangersofoptimizingcodetooearly.Next,wesaw—muchtothe reliefofslackerseverywhere—thattakingthelazywayout(andbuyingorrentingamore powerfulmachine)isoftenthemorecost-effectivesolution. Afterthat,wesawthatalittleknowledgeaboutthedynamicsofmemoryallocationand vectorizationinRcanoftengoalongwayinperformancegains. ThenexttwosectionsfocusedlessonchangingourRcodeandmoreonchanginghowwe useourcode.Specifically,wediscoveredthatthereareoftenperformancegainstobehad byjustchangingthepackagesweuseand/orourimplementationoftheRlanguage. Inanothersection,youlearnedhowparallelizationworksandwhat“embarrassing parallel”problemsare.Thenwerestructuredthecodesolvingareal-worldproblemto employparallelization.YoulearnedhowtodothisforbothWindowsandnon-Windows systems,andsawtheperformancegainsyoumightexpecttoseewhenyouparallelize embarrassinglyparallelproblems. Afterthat,wesolvedthesameexamplefromthelastsectionusingRcppandsawthat: ConnectingRandC++doesn’thavetobeasscaryasitsounds Theperformanceoftenblowsallotheralternativesoutofthewater. WeconcludewithaparablethatsuggeststhatlearninghowtowriteperformantRcodeis ajourneyandanartratherthanatopicthatcanbemasteredatonce. Chapter13.ReproducibilityandBest Practices Atthecloseofsomeprogrammingtexts,theuser,nowknowingtheintricaciesofthe subjectofthetext,isneverthelessbewilderedonhowtoactuallygetstartedwithsome seriousprogramming.Veryoften,discussionofthetooling,environment,andthelike— thethingsthatinveterateprogrammersoflanguagextakeforgranted—areleftforthe readertofigureoutontheirown. TakeR,forexample—whenyouclickontheRicononyoursystem,aratherSpartan windowwithatext-basedinterfaceappearsimploringyoutoentercommands interactively.AreyoutoprogramRinthismanner?Bytypingcommandsone-at-a-time intothiswindow?Thiswas,moreorless,permissibleupuntilthispointinthebook,butit justwon’tcutitwhenyou’reoutthereonyourown.Foranykindofseriouswork— requiringthererunningofanalyseswithmodifications,andsoon—youneedknowledge ofthetoolsandtypicalworkflowsthatprofessionalRprogrammersuse. Tonotleaveyouinthisunenviablepositionofnotknowinghowtogetstarted,dear reader,wewillbegoingthroughawholechapter’sworthofinformationontypical workflowsandcommon/bestpractices. Youmayhavealsonoticed(viatheenormoustextatthetopofthispage)thatthesubject discussedinthepreviousparagraphsissharingthespotlightwithreproducibility.What’s this,then? Reproducibilityistheabilityforyou,oranindependentparty,torepeatastudy, experiment,orlineofinquiry.Thisimpliesthepossessionofalltherelevantandnecessary materialsandinformation.Itisoneoftheprincipaltenetsofscientificinquiry.Ifastudyis notreplicable,itissimplynotscience. Ifyouareascientist,youarelikelyalreadyawareofthevirtuesofreproducibility(ifnot, shameonyou!).Ifyou’reanon-scientistdataanalyst,thereisgreatmeritinyourtaking reproducibilityseriously,too.Forone,startingananalysiswithreproducibilityinmind requiresaleveloforganizationthatmakesyourjobawholeloteasier,inthemediumand longrun.Secondly,thepersonwhoislikelygoingtobereproducingyouranalysesthe mostisyou;doyourselfafavor,andtakereproducibilityseriouslysothatwhenyouneed tomakechangestoananalysis,alteryourpriors,updateyourdatasource,adjustyour plotsandfigures,orrollbacktoanestablishedcheckpoint,youmakethingseasieron yourself.Lastly—andtruetotheintendedspiritofreproducibility—itmakesformore reliableandtrustworthydisseminationofinformation. Bytheway,allthesebenefitsstillholdevenifyouareworkingforaprivate(orotherwise confidential)enterprise,wheretheanalysesarenottoberepeatedorknownaboutoutside oftheinstitution.Theabilityofyourcoworkerstofollowthenarrativeofyouranalysisis invaluable,andcangiveyourfirmacompetitiveedge.Additionally,theabilityfor supervisorstotrackandaudityourprogressishelpful—ifyou’rehonest.Finally,keeping youranalysesreproduciblewillmakeyourcoworkers’livesmucheasierwhenyoufinally dropeverythingtogoliveonthehighseas. Anyway,wearetalkingaboutbestpracticesandreproducibilityinthesamechapter becauseoftheintimaterelationshipbetweenthetwogoals.Moreexplicitly,itisbest practiceforyourcodetobeasreproducibleaspossible. Bothreproducibilityandbestpracticesarewideanddiversetopics,buttheinformationin thischaptershouldgiveyouagreatstartingpoint. RScripting TheabsolutefirstthingyoushouldknowaboutstandardRworkflowsisthatprogramsare notgenerallywrittendirectlyattheinteractiveRinterpreter.Instead,Rprogramsare usuallywritteninatextfile(witha.ror.Rfileextension).Theseareusuallyreferredto asRscripts.Whenthesescriptsarecompleted,thecommandsinthistextfileareusually executedallatonce(we’llgettoseehow,soon).Duringdevelopmentofthescript, however,theprogrammerusuallyexecutesportionsofthescriptinteractivelytoget feedbackandconfirmproperbehavior.ThisinteractivecomponenttoRscriptingallows forbuildingeachcommandorfunctioniteratively. I’veknownsomeseriousRprogrammerswhocopyandpastefromtheirfavoritetext editorintoaninteractiveRsessiontoachievethiseffect.Tomostpeople,particularly beginners,thebettersolutionistouseaneditorthatcansendRcodefromthescriptthatis activelybeingwrittentoaninteractiveRconsole,line-by-line(orblock-by-block).This providesaconvenientmechanismtoruncode,getfeedback,andtweakcode(ifneedbe) withouthavingtoconstantlyswitchwindows. Ifyou’reauserofthevenerableVimeditor,youmayfindthattheVim-R-pluginachieves thisnicely.IfyouusetheequallyreveredEmacseditor,youmayfindthatEmacsSpeaks Statistics(ESS)accomplishesthisgoal.Ifyoudon’thaveanycompellingreasonnotto, though,IstronglysuggestyouuseRStudiotofillthisneed.RStudioisapowerful,free IntegratedDevelopmentEnvironment(IDE)forR.NotonlydoesRStudiogiveyouthe abilitytosendblocksofcodetobeevaluatedbytheRinterpreterasyouwriteyourscripts butitalsoprovidesalltheaffordancesyou’dexpectfromthemostadvancedofIDEssuch assyntaxhighlighting,aninteractivedebugger,codecompletion,integratedhelpand documentation,andprojectmanagement.ItalsoprovidessomeveryhelpfulR-specific functionalitylikeamechanismforvisualizingadataframeinmemoryasaspreadsheet andanintegratedplotwindow.Lastly,itisverywidelyusedwithintheRcommunity,so thereisanenormousamountofhelpandsupportavailable. GiventhatRStudioissohelpful,someoftheremainderofthechapterwillassumeyouare usingit. RStudio Firstthingsfirst—gotohttp://www.rstudio.com,andnavigatetothedownloadspage. DownloadandinstalltheOpenSourceEditionoftheRStudioDesktopapplication. WhenyoufirstopenRStudio,youmayonlyseethreepanes(asopposedtothefourpaned windowsinFigure13.1).Ifthisisthecase,clickthebuttonlabeledeinFigure13.1,and clickRScriptfromthedropdown.NowtheRStudiowindowshouldlookalotlikethe onefromFigure13.1. Thefirstthingyoushouldknowabouttheinterfaceisthatallofthepanelsservemore thanonefunction.Thepanelabeledaisthesourcecodeeditor.Thiswillbethepane whereinyouedityourRscripts.ThiswillalsoserveastheeditorpanelforLaTeX,C++, orRMarkdown,ifyouarewritingthesekindsoffiles.Youcanworkonmultiplefilesat thesametimeusingtabstoswitchfromdocumenttodocument.Panelawillalsoserveas adataviewerthatwillallowyoutoviewdatasetsloadedinmemoryinaspreadsheet-like manner. PanelbistheinteractiveRconsole,whichisfunctionallyequivalenttotheinteractiveR consolethatshippedwithRfromCRAN.Thispanewillalsodisplayotherhelpful informationortheoutputofvariousgoings-oninsecondaryortertiarytabs. Panelcallowsyoutoseetheobjectsthatyouhavedefinedinyourglobalenvironment. Forexample,ifyouloadadatasetfromdiskortheweb,thenameofthedatasetwill appearinthispanel;ifyouclickonit,RStudiowillopenthedatasetinthedataviewerin panela.ThispanelalsohasatablabeledHistory,thatyoucanusetoviewRstatements we’veexecutedinthepast. Paneldisthemostversatileone;dependingonwhichofitstabsareopen,itcanbeafile explorer,aplot-displayer,anRpackagemanager,andahelpbrowser. Figure13.1:RStudio’sfour-panelinterfaceinMacOSX(version0.99.486) ThetypicalRscriptdevelopmentworkflowisasfollows:Rstatements,expressions,and functionsaretypedintotheeditorinpanela;statementsfromtheeditorareexecutedin theconsoleinpanelbbyputtingthecursoronachosenlineandclickingtheRunbutton (componentgfromthefigure),orbyselectingmultiplelinesandthenclickingtheRun button.Iftheoutputsofanyofthesestatementsareplots,paneldwillautomatically displaythese.Thescriptisnamedandsavedwhenthescriptiscomplete(or,preferably, manytimeswhileyouarewritingit). TolearnyourwayaroundtheRStudiointerface,writeanRscriptcallednothing.Rwith thefollowingcontent: library(ggplot2) nothing<-data.frame(a=rbinom(1000,20,.5), b=c("red","white"), c=rnorm(1000,mean=100,sd=10)) qplot(c,data=nothing,geom="histogram") write.csv(nothing,"nothing.csv") Executethestatementsonebyone.Noticethatthehistogramisautomaticallydisplayedin paneld.Afteryouaredone,typeandexecute?rbinomintheinteractiveconsole.Notice howpanelddisplaysthehelppageforthisfunction?Finally,viewclickontheobject labelednothingpanelcandinspectthedatasetinthedataviewer. RunningRscripts ThereareafewwaystorunsavedRscripts,likenothing.R.First—andthisisRStudio specific—istoclickthebuttonlabeledSource(componenth).Thisisroughlyequivalent tohighlightingtheentiredocumentandclickingRun. Ofcourse,wewouldliketorunRscriptswithoutbeingdependentonRStudio.Oneway todothisistousethesourcefunctionintheinteractiveRconsole—eitherRStudio’s console,theconsolethatshipswithRfromCRAN,oryouroperatingsystem’scommand promptrunningR.Thesourcefunctiontakesafilenameasit’sfirstandonlyrequired argument.Thefilenamespecifiedwillbeexecuted,andwhenit’sdone,itwillreturnyou tothepromptwithalltheobjectsfromtheRscriptnowinyourworkspace.Trythiswith nothing.R;executingthels()commandafterthesourcefunctionendsshouldindicate thatthenothingdataframeisnowinyourworkspace.Callingthesource()functionis whathappensunderthehoodwhenyoupresstheSourcebuttoninRStudio.Ifyouhave troublemakingthiswork,makesurethateither(a)youspecifythefullpathtothefile nothing.Rinthesource()functioncall,or(b)youusesetwd()tomakethedirectory containingnothing.Ryourcurrentworkingdirectory,beforeyouexecute source("nothing.R"). Athird,lesspopularmethodistousetheRCMDBATCHcommandonyouroperating system’scommand/terminalprompt.Thisshouldworkonallsystems,outofthebox, exceptWindows,whichmayrequireyoutoaddtheRbinaryfolder(usually,something like:C:\ProgramFiles\R\R-3.2.1\bin)toyourPATHvariable.Thereareinstructionson howtoaccomplishthisontheweb. Note Yoursystem’scommandprompt(orterminalemulator)willdependonwhichoperating systemyouuse.Windowusers’commandpromptiscalledcmd.exe(whichyoucanrunby pressingWindows-key+R,typingcmd,andstrikingenter).Macintoshusers’terminal emulatorisknownasTerminal.app,andisunder/Applications/Utilities.Ifyouuse GNU/LinuxorBSD,youknowwheretheterminalis. Usingthefollowingincantation: RCMDBATCHnothing.R Thiswillexecutethecodeinthefile,andautomaticallydirectit’soutputintoafilenamed nothing.Rout,whichcanbereadwithanytexteditor. Rmayhaveaskedyou,anytimeyoutriedtoquitR,whetheryouwantedtosaveyour workplaceimage.SavingyourworkplaceimagemeansthatRwillcreateaspecialfilein yourcurrentworkingdirectory(usuallynamed.RData)containingalltheobjectsinyour currentworkspacethatwillbeautomaticallyloadedagainifyoustartRinthatdirectory. ThisissuperusefulifyouareworkingwithRinteractivelyandyouwanttoexitR,butbe abletopickupandwritewhereyouleftoffsomeothertime.However,thiscancause issueswithreproducibility,sinceanotheruseRwon’thavethesame.RDatafileontheir computer(andyouwon’thaveitwhenyourerunthesamescriptonanothercomputer). Forthisreason,weuseRCMDBATCHwiththe--vanillaoption: R--vanillaCMDBATCHnothing.R whichmeansdon’trestorepreviouslysavedobjectsfrom.RData,don’tsavetheworkplace imagewhentheRscriptisdonerunning,anddon’treadtheanyofthefilesthatcanstore customRcodethatwillautomaticallyloadineachRsession,bydefault.Basically,this amountstodon’tdoanythingthatwouldbeabletobereplicatedusinganothercomputer andRinstallation. Thefinalmethod—whichismypreference—istousetheRscriptprogramthatcomes withrecentversionsofR.OnGNU/Linux,Macintosh,oranyotherUnix-likesystemthat supportsR,thiswillautomaticallybeavailabletousefromthecommand/terminalprompt. OnWindows,theaforementionedRbinaryfoldermustbeaddedtoyourPATHvariable. UsingRscriptisaseasyastypingthefollowing: Rscriptnothing.R Or,ifyoucareaboutreproducibility(andyoudo!): Rscript--vanillanothing.R ThisisthewayIsuggestyourunRscriptswhenyou’renotusingRStudio. Note IfyouareusingaUnixorUnix-likeoperatingsystem(likeMacOSXorGNU/Linux), youmaywanttoputalinelike#!/usr/bin/Rscript--vanillaasthefirstlineinyourR scripts.Thisiscalledashebangline,andwillallowyoutorunyourRscriptsasaprogram withoutspecifyingRscriptattheprompt.Formoreinformation,readthearticleShebang (Unix)onWikipedia. Anexamplescript Here’sanexampleRscriptthatwewillbereferringtofortherestofthechapter: #!/usr/bin/Rscript--vanilla ########################################################### #### ##nyc-sat-scores.R## #### ##Author:TonyFischetti## ##[email protected]## #### ########################################################### ## ##Aim:touseBayesiananalysistocompareNYC's2010 ##combinedSATscoresagainsttheaverageofthe ##restofthecountry,which,accordingto ##FairTest.com,is1509 ## #workspacecleanup rm(list=ls()) #options options(echo=TRUE) options(stringsAsFactors=FALSE) #libraries library(assertr)#fordatachecking library(runjags)#forMCMC #makesureeverythingisallsetwithJAGS testjags() #yep! ##readdatafile #datawasretrievedfromNYCOpenDataportal #directlink:https://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv? accessType=DOWNLOAD nyc.sats<-read.csv("./data/SAT_Scores_NYC_2010.csv") #let'sgivethecolumnseasiernames better.names<-c("id","school.name","n","read.mean", "math.mean","write.mean") names(nyc.sats)<-better.names #thereare460rowsbutalmost700NYCschools #wewill*assume*,then,thatthisisarandom #sampleofNYCschools #let'sfirstchecktheveracityofthisdata… #nyc.sats<-assert(nyc.sats,is.numeric, #n,read.mean,math.mean,write.mean) #Itlookslikecheckfailedbecausethereare"s"sforsome #rows.(??)Alookatthedatasetdescriptionsindicates #thatthe"s"isforschools#with5orfewerstudents. #Forourpurposes,let'sjustexcludethem. #Thisisafunctionthattakesavector,replacesall"s"s #withNAsandmakecovertsallnon-"s"sintonumerics remove.s<-function(vec){ ifelse(vec=="s",NA,vec) } nyc.sats$n<-as.numeric(remove.s(nyc.sats$n)) nyc.sats$read.mean<-as.numeric(remove.s(nyc.sats$read.mean)) nyc.sats$math.mean<-as.numeric(remove.s(nyc.sats$math.mean)) nyc.sats$write.mean<-as.numeric(remove.s(nyc.sats$write.mean)) #Removeschoolswithfewerthan5testtakers nyc.sats<-nyc.sats[complete.cases(nyc.sats),] #CalculateatotalcombinedSATscore nyc.sats$combined.mean<-(nyc.sats$read.mean+ nyc.sats$math.mean+ nyc.sats$write.mean) #Let'sbuildaposteriordistributionofthetruemean #ofNYChighschools'combinedSATscores. #We'renotgoingtolookatthesummarystatistics,because #wedon'twanttobiasourpriors #Specifyastandardgaussianmodel the.model<-" model{ #priors mu~dunif(0,2400) stddev~dunif(0,500) tau<-pow(stddev,-2) #likelihood for(iin1:theLength){ samp[i]~dnorm(mu,tau) } }" the.data<-list( samp=nyc.sats$combined.mean, theLength=length(nyc.sats$combined.mean) ) results<-autorun.jags(the.model,data=the.data, n.chains=3, monitor=c('mu','stddev')) #ViewtheresultsoftheMCMC print(results) #PlottheMCMCdiagnostics plot(results,plot.type=c("histogram","trace"),layout=c(2,1)) #Looksgood! #Let'sextracttheMCMCsamplesofthemeanandgetthe #boundsofthemiddle95% results.matrix<-as.matrix(results$mcmc) mu.samples<-results.matrix[,'mu'] bounds<-quantile(mu.samples,c(.025,.975)) #Weare95%surethatthetruemeanisbetween1197and1232 #Nowlet'splotthemarginalposteriordistributionforthemean #oftheNYChighschools'combinedSATgradesanddrawthe95% #percentcredibleinterval. plot(density(mu.samples), main=paste("PosteriordistributionofmeancombinedSAT", "scoreinNYChighschools(2010)",sep="\n")) lines(c(bounds[1],bounds[2]),c(0,0),lwd=3,col="red") #Giventheresults,theSATscoresforNYChighschoolsin2010 #are*incontrovertibly*notatparwiththeaverageSATscoresof #thenation. There’reafewthingsI’dlikeyoutonoteaboutthisRscript,andit’sadherencetobest practices. First,thefilenameisnyc-sat-scores.R—notfoo.R,doit.R,oranyofthatnonsense; whenyouarelookingthroughyourfilesinsixmonths,therewillbenoquestionabout whatthefilewassupposedtodo. Thesecondisthatcommentsaresprinkledliberallythroughouttheentirescript.These commandsservetostatetheintentionsandpurposeoftheanalysis,separatesectionsof code,andremindourselves(oranyonewhoisreading)wherethedatafilecamefrom. Additionally,commentsareusedtoblockoutsectionsofcodethatwe’dliketokeepinthe script,butwhichwedon’twanttoexecute.Inthisexample,wecommentedoutthe statementthatcallsassert,sincetheassertionfails.Withthesecomments,anybody— evenanRbeginner—canfollowalongwiththecode. Thereareafewothermanifestationsofgoodpracticeondisplayinthisscript:indention thataidsinfollowingthecodeflow,spacesandnew-linesthatenhancereadability,lines thatarerestrictedtounder80characters,andvariableswithinformativenames(nofoo, bar,orbaz). Lastly,takenoteoftheremove.sfunctionweemployinsteadofcopy-and-pasting ifelse(vec=="s",NA,…)fourtimes.Anangellosesitswingseverytimeyoucopy-andpastecode,sinceitisanotoriousvectorformistakes. Scriptingandreproducibility Putanycodethatisnotone-off,andismeanttoberunagain,inascript.Evenforone-off code,youarebetteroffputtingitinascript,because(a)youmaybewrong(andoftenare) aboutnotneedingtorunitagain,(b)itprovidesarecordofwhatyou’vedone(including, perhaps,unnoticedbugs),and(c)youmaywanttousesimilarcodeatanothertime. Scriptingenhancesreproducibility,becausenow,theonlythingsweneedtoreproducethis lineofinquiryonanothercomputerarethescriptandthedatafile.Ifwedidn’tplaceall thiscodeinascript,wewouldhavehadtocopyandpasteourinteractiveRconsole history,whichisuglyandmessytosaytheabsoluteleast. It’stimetocomecleanaboutafibItoldintheprecedingparagraph.Inmostcases,allyou needtoreproducetheresultsarethedatafile(s)andtheRscript(s).Insomecases, however,somecodeyou’vewrittenthatworksinyourversionofRmaynotworkon anotherperson’sversionofR.Somewhatmorecommonisthatthecodeyouwrite,which usesafunctionalityprovidedbyapackage,maynotworkonanotherversionofthat package. Forthisreason,it’sgoodpracticetorecordtheversionofRandthepackagesyou’re using.YoucandothisbyexecutingsessionInfo(),andcopyingtheoutputandpastingit intoyourRscriptatthebottom.Makesuretocommentalloftheselinesout,orRwill attempttoexecutethemthenexttimethescriptisrun.Foraprettier/betteralternativeto sessionInfo(),usethesession_info()functionfromthedevtoolspackage.Theoutput ofdevtools::session_info()forourexamplescriptlookslikethis: >devtools::session_info() Sessioninfo--------------------------------settingvalue versionRversion3.2.1(2015-06-18) systemx86_64,darwin13.4.0 uiRStudio(0.99.486) language(EN) collateen_US.UTF-8 tzAmerica/New_York date1969-07-20 Packages------------------------------------package*versiondatesource assertr*1.0.02015-06-26CRAN(R3.2.1) coda0.17-12015-03-03CRAN(R3.2.0) devtools1.9.12015-09-11CRAN(R3.2.0) digest0.6.82014-12-31CRAN(R3.2.0) lattice0.20-332015-07-14CRAN(R3.2.0) memoise0.2.12014-04-22CRAN(R3.2.0) modeest2.12012-10-15CRAN(R3.2.0) rjags3-152015-04-15CRAN(R3.2.0) runjags*2.0.2-82015-09-14CRAN(R3.2.0) Thepackagesthatweexplicitlyloadedaremarkedwithanasterisk;alltheotherpackages listedarepackagesthatareusedbythepackagesweloaded.Itisimportanttonotethe versionofthesepackages,too,astheycanpotentiallycausecross-version irreproducibility. Rprojects Therearesome(rare)caseswhereasingleRscriptcontainsthetotalityofyour research/analyses.Thismayhappenifyouaredoingsimulationstudies,forexample.For mostcases,ananalysiswillconsistofascript(orscripts)andatleastonedataset.Irefer toanyRanalysisthatusesatleasttwofilesasanRproject. InRprojects,specialattentionmustbepaidtohowthefilesarestoredrelativetoeach other.Forexample,ifwestoredthefileSAT_Scores_NYC_2010.csvonourdesktop,the dataimportlinewouldhaveread: read.csv("/Users/bensisko/Desktop/SAT_Scores_NYC_2010.csv") Ifyouwanttosendthisanalysistoacontributortobereplicated,wewouldsendthemthe scriptandthedatafile.Evenifweinstructedthemtoplacethefileontheirdesktop,the scriptwouldstillnotbereproducible.OurcollaboratorsonWindowsandUnixwould havetomanuallychangetheargumentofread.csvto C:/Users/jameskirk/Desktop/SAT_Scores_NYC_2010.csvor /home/katjaneway/Desktop/SAT_Scores_NYC_2010.csv,respectively. Afarbetterwaytohandlethissituationistoorganizeallyourfilesinaneathierarchythat willallowyoutospecifyrelativepathsforyourdataimports.Inthiscase,itmeans makingafoldercalledsat-scores(orsomethinglikethat),whichcontainsthescriptnycsat-scores.RandafoldercalleddatathatcontainsthefileSAT_Scores_NYC_2010.csv: Figure13.2:Asamplefile/folderhierarchyforanRanalysisproject Thefunctioncallread.csv("./data/SAT_Scores_NYC_2010.csv")instructsRtoloadthe datasetinsidethedatafolderinthecurrentworkingdirectory.Now,ifwewantedtosend ouranalysistoacollaborator,wewouldjustsendthemthefolder(whichwecan compress,ifwewant),anditwillworknomatterwhatourcollaborator’susernameand operatingsystemis.Additionally,everythingisniceandneat,andinoneplace.Notethat weputafilecalledREADME.txtintotherootdirectoryofourproject.Thisfilewould containinformationabouttheanalysis,instructionsforrunningit,andsoon.Thisisa commonconvention. Anyway,neveruseabsolutepaths! InprojectsthatusemorethanoneRscript,somechooseaslightlydifferentprojectlayout. Forexample,let’ssaywedividedourprecedingscriptintoload-and-clean-sat-data.R andanalyze-sat-data.R;wemightchooseafolderhierarchythatlookslikethis: Figure13.3:Asamplefile/folderhierarchyforamultiscriptRanalysisproject Underthisorganizationalparadigm,thetwoscriptsarenowplacedinafoldercalledcode, andanewscriptmaster.Risplacedintheproject’srootdirectory.master.Riscalled driverscript,anditwillcallourtwonon-driverscriptsintherightorder.Forexample, master.Rmaylooklikethis: #!/usr/bin/Rscript--vanilla source("./code/load-and-clean-sat-data.R") source("./code/analyze-sat-data.R") Now,ourcollaboratorjusthastoexecutemaster.R,whichwill,inturn,executeour analysisscripts. Note ThereareafewalternativestousinganRscriptasadriver.Onecommonalternativeisto useashellscriptasadriver.Thesescriptscontaincodethatisrunbytheoperating system’scommand-lineinterpreter.Adownsideofthisapproachisthatshellscriptsare,in general,notportableacrosstheWindowsversusall-other-operating-systemsdivide. Acommon,butsomewhatmoreadvancedalternative,istoreplacemaster.Rwitha dependency-trackingbuildutilitylikemake,shake,sake,ordrake.Thisoffersahostof benefitsincludingextensibilityandidentificationofredundantcomputations. Versioncontrol Averycompellingbenefittoourneathierarchicalorganizationschemeisthatitlends itselftoeasyintegrationwithversioncontrolsystems.Versioncontrolsystems,atabasic level,allowonetotrackchanges/revisionstoasetoffiles,andeasilyrollbacktoprevious statesofthesetoffiles. Asimple(andinadequate)approachistocompressyouranalysisprojectatregular intervals,andpost-fixthefilenameofeachcompressedcopywithatimestamp.Thisway, ifyoumakeamistake,andwouldliketoreverttoapreviousversion,allyouhavetodois deleteyourcurrentprojectandun-compresstheprojectfromthetimeyouwanttoroll backto. Afarmoresanesolutionistousearemotefilesynchronizationservicethatfeatures revisiontracking.ThemostpopularoftheseservicesatthetimeofwritingisDropbox, thoughthereareotherssuchasTeamDriveandBox.Theseservicesallowyoutoupload yourprojectintothecloud.Whenyoumakechangestoyourlocalcopy,theseservices willtrackyourchanges,resynchronizetheremotelystoredcopy,andversionyourproject foryou.Nowyoucanreverttoapreviousversionofjustonefile,insteadofhavingto reverttheentireprojecthierarchy. Note Beware!Someoftheseserviceshavealimitonthenumberofrevisionstheytrack.Make sureyoulookintothisfortheservicethatyouchoosetouse. Agreatbenefitofusingoneoftheseservicesisthatanynumberofcollaboratorscanbe invitedtoworkontheprojectsimultaneously.Youcanevensetpermissionsforthefiles eachcollaboratorcanread/writeto.Theserviceyouchooseshouldbeabletotrackthe changesmadebythecollaborators,too. Perhaps,thesanestsolutionistouseanactualversioncontrolsystemlikeGit,Mercurial, Subversion,orCVS.Thesearetraditionallyusedforsoftwareprojectsthatcontain hundredsoffilesandmanymanycontributors,butit’sprovingtobeacrackerjacksolution todataanalystswithjustafewfilesandlittletonoothercontributors.Thesealternatives offerthemostflexibilityintermsofrollback,revisiontracking,conflict(incompatible changes)resolution,compression,andmerging.ThecombinationofGitandGitHub(a remoteGitrepositoryhostingservice)isprovingtobeaparticularlyeffectiveand commonsolutiontostatisticalprogrammers. Versioncontrolenhancesreproducibility—sinceallthechangestotheentireproject (scripts/data/folder-structurelayouts)aredocumented,allthechangesarerepeatable. Ifyourdatafilesaresmalltomedium,keepingtheminyourprojectwillplaynicelywith yourversioncontrolsolution;itwillevenoffergreatbenefitsliketheassurancethatno onetamperedwithyourdata.Ifyourdataistoolarge,though,youmightlookintoother datastoragesolutionslikeremotedatabasestorage. Note Packageversionmanagement SomeRanalysts,whorelyheavilyontheuseofadd-onCRANpackages,maychooseto useatooltomanagethesepackagesandtheirversions.Thetwomostpopulartoolstodo thisarethepackagepackratandcheckpoint. packrat,whichisthemorepopularofthetwo,maintainsalibraryofthepackagesan analysisusesinsidetheproject’srootdirectory.Thisallowstheanalysisandthepackages itdependsontobeversioncontrolled. checkpointallowsyoutousetheversionsofCRANpackagesastheywereona particulardate.AnanalystwouldstorethedateoftheCRANsnapshotusedatthetopofa script,andtheproperversionsofthesepackageswouldautomaticallydownloadona collaborator’smachine. Communicatingresults Unlessananalysisisperformedsolelyforthepersonaledificationoftheanalyst,the resultsaregoingtobecommunicated—eithertoteammates,yourcompany,yourlab,or thegeneralpublic.SomeveryadvancedtechnologiesareinplaceforRprogrammersto communicatetheirresultsaccuratelyandattractively. Followingthepatternofsomeoftheothersectionsinthischapter,wewilltalkabouta rangeofapproachesstartingwithabadalternativeandgiveanexplanationforwhyit’s inadequate. TheterriblesolutiontothecreatingofastatisticalreportistocopyRoutputintoaWord document(orPowerPointpresentation)mixedwithprose.Whyisthisterrible?youask? Becauseifonelittlethingaboutyouranalysischanges,youwillhavetore-copythenewR outputintothedocument,manually.Ifyoudothisenoughtimes,it’snotamatterofifbut amatterofwhenyouwillmessupandcopythewrongthing,orforgettocopythenew output,andsoon.Thismethodjustopensuptoomanyvectorsformistakes.Additionally, anytimeyouhavetomakeaslightchangetoaplot,updateadatasource,alterpriors,or evenchangethenumberofmultipleimputationiterationstouse,itrequiresaherculean effortonyourparttokeepthedocumentuptodate. AllbettersolutionsinvolvehavingRdirectlyoutputthedocumentthatyouwilluseto communicateyourresults.RStudio(alongwiththeknitrandrmarkdownpackages) makesitveryeasyforyoutohaveyouranalysisspitoutapaperrenderedwithLaTeX,a slideshowpresentation,oraself-containedHTMLwebpage.It’sevenpossibletohaveR directlyoutputaWorddocument,whosecontentsaredynamicallycreatedusingRobjects. Theleastattractive,buteasiestofthealternatives,istousetheCompileNotebook functionfromtheRStudiointerface(thebuttonlabeledfinFigure13.1).Apop-upshould appearaskingyouifyouwanttheoutputinHTML,PDF,oraWorddocument.Choose oneandlookattheoutput. Figure13.4:AnexcerptfromtheoutputofCompileNotebookonourexamplescript Sure,thismaynotbetheprettiestdocumentintheworld,butatleastitcombinesourcode (includingourinformativecomments)andresults(plots)inasingledocument.Further, anychangetoourRscriptfollowedbyrecompilingthenotebookwillresultina completelyupdateddocumentforsharing.It’salittlebitweirdtohaveournarrativetold completelyviacomments,though,right? Literateprogrammingisanovelprogrammingparadigmputforthbygeniuscomputer scientistDonaldKnuth(whowementionedinthepreviouschapter).Thisapproach involvesinterspersingcomputercodeandproseinthesamedocument.Whereasthe CompileNotebookfeaturedoesn’tallowforprose(exceptincodecomments),the RStudio/knitr/rmarkdownstackallowsforanapproachtoreportgenerationwherethe prose/narrativeplaysamoreintegralpart.Tobegin,clicktheNewDocumentbutton (componente),andchooseRMarkdown…fromthedropdown.Chooseatitlelike example1inthepop-upwindow,leavethedefaultoutputformat,andpressOK.You shouldseeadocumentwithsomeunfamiliarsymbolsintheeditor.Finally,clickthe buttonlabeledKnitHTML(it’sthebuttonwiththecuteimageofaballofyarn),and inspecttheoutput. Gobacktotheeditorandre-readthecodethatproducedtheHTMLoutput.ThisisR Markdown:alightweightmarkuplanguagewitheasy-to-rememberformattingsyntax elementsandsupportfortheembeddedRcode. Besidestheauto-generatedheader,thedocumentconsistsofaseriesoftwocomponents. ThefirstofthecomponentsisstretchesofprosewritteninMarkdown.WithMarkdown,a rangeofformattingoptionscanbewritteninplaintextthatcanberenderedinmany differentoutputformats,likeHTMLandPDF.Theseformattingoptionsaresimple: *This*producesitalictext;**this**producesboldtext.Forahandycheatsheetof Markdownformattingoptions,clickthequestionmarkicon(whichappearswhenyouare editingRMarkdown[.Rmd]documents),andchooseMarkdownQuickReferencefrom thedropdown. ThesecondcomponentissnippetsofRcodecalledchunks.Thesechunksareputbetween twosetsofbackticks(```).Thesetofthreebackticksthatopenachunklooklike```{r}. Betweenthecurlybraces,youcanoptionallynamethechunk,andyoucanspecifyany numberofchunkoptions.Notethatinexample1.Rmd,thesecondchunkusestheoption echo=FALSE;thismeansthatthecodesnippetplot(cars)willnotappearinthefinal rendereddocument,eventhoughitsoutput(namely,theplot)will. There’sanelementofRMarkdownthatIwanttocalloutexplicitly:inlineRcode.During stretchesofprose,anytextbetween`rand`isevaluatedbytheRinterpreter,and substitutedwithitsresultinthefinalrendereddocument.Withoutthismechanism,any specificnumbers/informationrelatedtothedataobjects(likethenumberofobservations inadataset)havetobehardcodedintotheprose.Whenthecodechanged,theonusof visitingeachofthesehardcodedvaluestomakesuretheyareuptodatewasonthereport author.UsinginlineRtooffloadthisupdatingontoReliminatesanentireclassof commonmistakesinreportgeneration. Whatfollowsisare-workingofourSATscriptinRMarkdown.Thiswillgiveusachance tolookatthistechnologyinmoredetail,andgainanappreciationforhowitcanhelpus achieveourgoalsofeasy-to-managereproducible,literateresearch. --title:"NYCSATScoresAnalysis" author:"TonyFischetti" date:"November1,2015" output:html_document --####Aim: TouseBayesiananalysistocompareNYC's2010 combinedSATscoresagainsttheaverageofthe restofthecountry,which,accordingto FairTest.com,is1509 ```{r,echo=FALSE} #options options(echo=TRUE) options(stringsAsFactors=FALSE) ``` Wearegoingtousethe`assertr`and`runjags` packagesfordatacheckingandMCMC,respectively. ```{r} #libraries library(assertr)#fordatachecking library(runjags)#forMCMC ``` Let'smakesureeverythingisallsetwithJAGS! ```{r} testjags() ``` Great! ThisdatawasfoundintheNYCOpenDataPortal: https://nycopendata.socrata.com ```{r} link.to.data<-"http://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv? accessType=DOWNLOAD" download.file(link.to.data,"./data/SAT_Scores_NYC_2010.csv") nyc.sats<-read.csv("./data/SAT_Scores_NYC_2010.csv") ``` Let'sgivethecolumnseasiernames ```{r} better.names<-c("id","school.name","n","read.mean", "math.mean","write.mean") names(nyc.sats)<-better.names ``` Thereare`rnrow(nyc.sats)`rowsbutalmost700NYCschools.Wewill, therefore,*assume*thatthisisarandomsampleofNYCschools. Let'sfirstchecktheveracityofthisdata… ```{r,error=TRUE} nyc.sats<-assert(nyc.sats,is.numeric, n,read.mean,math.mean,write.mean) ``` Itlookslikecheckfailedbecausethereare"s"sforsomerows.(??) Alookatthedatasetdescriptionsindicatesthatthe"s"isforschools with5orfewerstudents.Forourpurposes,let'sjustexcludethem. Thisisafunctionthattakesavector,replacesall"s"s withNAsandmakecovertsallnon-"s"sintonumerics ```{r} remove.s<-function(vec){ ifelse(vec=="s",NA,vec) } nyc.sats$n<-as.numeric(remove.s(nyc.sats$n)) nyc.sats$read.mean<-as.numeric(remove.s(nyc.sats$read.mean)) nyc.sats$math.mean<-as.numeric(remove.s(nyc.sats$math.mean)) nyc.sats$write.mean<-as.numeric(remove.s(nyc.sats$write.mean)) ``` Nowwearegoingtoremoveschoolswithfewerthan5testtakers andcalculateacombinedSATscore ```{r} nyc.sats<-nyc.sats[complete.cases(nyc.sats),] #CalculateatotalcombinedSATscore nyc.sats$combined.mean<-(nyc.sats$read.mean+ nyc.sats$math.mean+ nyc.sats$write.mean) ``` Let'snowbuildaposteriordistributionofthetruemeanofNYChigh schools'combinedSATscores.We'renotgoingtolookatthesummary statistics,becausewedon'twanttobiasourpriors. Wewilluseastandardgaussianmodel. ```{r,cache=TRUE,results="hide",warning=FALSE,message=FALSE} the.model<-" model{ #priors mu~dunif(0,2400) stddev~dunif(0,500) tau<-pow(stddev,-2) #likelihood for(iin1:theLength){ samp[i]~dnorm(mu,tau) } }" the.data<-list( samp=nyc.sats$combined.mean, theLength=length(nyc.sats$combined.mean) ) results<-autorun.jags(the.model,data=the.data, n.chains=3, monitor=c('mu')) ``` Let'sviewtheresultsoftheMCMC. ```{r} print(results) ``` Nowlet'splottheMCMCdiagnostics ```{r,message=FALSE} plot(results,plot.type=c("histogram","trace"),layout=c(2,1)) ``` Looksgood! Let'sextracttheMCMCsamplesofthemean,andgetthe boundsofthemiddle95% ```{r} results.matrix<-as.matrix(results$mcmc) mu.samples<-results.matrix[,'mu'] bounds<-quantile(mu.samples,c(.025,.975)) ``` Weare95%surethatthetruemeanisbetween `rround(bounds[1],2)`and`rround(bounds[2],2)`. Nowlet'splotthemarginalposteriordistributionforthemean oftheNYChighschools'combinedSATgrades,anddrawthe95% percentcredibleinterval. ```{r} plot(density(mu.samples), main=paste("PosteriordistributionofmeancombinedSAT", "scoreinNYChighschools(2010)",sep="\n")) lines(c(bounds[1],bounds[2]),c(0,0),lwd=3,col="red") ``` Giventheresults,theSATscoresforNYChighschoolsin2010 are**incontrovertibly**notatparwiththeaverageSATscoresof thenation. -----------------------------------Thisissomesessioninformationforreproducibility: ```{r} devtools::session_info() ``` ThisRMarkdown,whenrenderedbyknittingtheHTML,lookslikethis: Figure13.5:AnexcerptfromtheoutputofKnitHTMLonourexampleRMarkdown document Now,that’sahandsomedocument! Afewthingstonote:First,ourcontextualnarrativeisnolongertoldthroughcode comments;thenarrative,code,codeoutput,andplotsareallseparateandeasily distinguished.Second,notethatboth,thenumberofobservationsinthedatasetandthe boundsofourcredibleinterval,aredynamicallywovenintothefinaldocument.Ifwe changeourpriors,oruseadifferentlikelihoodfunction(andweshould—seeexercise#3), theboundsastheyappearinourfinalreportwillbeautomaticallyupdated. Finally,takealookatthechunkoptionswe’veused.Wehidthecodeinourfirstchunkso thatwedidn’tclutterthefinaldocumentwithoptionsetting.Inthesixthchunk,weused theoptionerror=TRUEtolettherendererknowthatweexpectedthecontainedcodeto fail.Theprintederrormessagenicelyillustrateswhywehadtospendthesubsequent chunkondatacleaning.Intheninthchunk(theonewhereweruntheMCMCchains),we usequiteafewoptions.cache=TRUEcachestheresultofthechunksothatifthechunk’s codedoesn’tchange,wedon’thavetowaitforMCMCchainstoconvergeeverythingwe renderthedocument.Weuseresults="hide"tohidetheverboseoutputof autorun.jags.Weusewarning=FALSEtosuppressthewarningemittedbyautorun.jags informingusthatwedidn’tchoosestartingvaluesforthechains.Lastly,weuse message=FALSEtoquietthemessageproducedbyaautorun.jagsthattherjags namespaceisautomaticallybeingloaded.autorun.jagssureischatty! Wemayopttousedifferentchunkoptionsdependingonourintendedaudience.For example,wecouldhidemoreofthecode—andfocusmoreontheoutputand interpretation—ifwewerecommunicatingtheresultstoapartyofnon-statisticalprogrammers.Ontheotherhand,wewouldhidelessofthecodeifwewereusingthe renderedHTMLasapedagogicaldocumenttoteachbuddingRprogrammershowtouse RMarkdown. TheHTMLthatisproducedcannowbeuploaded—asastandalonedocument—toaweb serversothattheresultscanbesenttoothersasahyperlink.Bearinmind,too,thatweare notlimitedtoknittingHTML;wecouldhavejustaseasilyknittedaPDForWord document.WecouldhavealsousedRMarkdowntoproduceaslideshowpresentation—I usethistechnologyallthetimeatwork. Youdon’thavetonecessarilyuseRStudiotoproducethesehandsome,dynamicallygeneratedreports(theycanberenderedusingonlytheknitrandrmarkdownpackagesand aformatconversionutilitycalledpandoc),butRStudiomakeswritingthemsoeasy,you wouldneedareallycompellingreasontouseanyothereditor. knitrisabeefypackageindeed,andweonlytouchedonthetipoftheiceberginregardto whatitiscapableof;wedidn’tcover,forexample,customizingthereportswithHTML, embeddingMathequationsintothereports,andusingLaTeX(insteadofRMarkdown) forincreasedflexibility.Ifyouseethatpowerinknitr,anddynamically-generatedliterate documentsingeneral,Iurgeyoutolearnmoreaboutit. Exercises Practicethefollowingexercisestorevisetheconceptofreproducibilitylearnedinthis chapter: Review:Whenwecreatedthedataframenothing,wecombinedavectorof1,000 binomiallydistributedrandomvariables,1,000normallydistributedrandom variables,andavectoroftwocolors,redandwhite.Sinceallthecolumnsinadata framehavetobethesamelength,howdidRallowthis?Whatisthepropertyof vectorsthatallowsthis? Seekout,read,andattempttounderstandthesourcecodeofsomeofyourfavoriteR packages.Whatversioncontrolsystemistheauthorofthepackageusing? Carefullyreviewtheanalysisthatwasusedasanexampleinthischapter.Inwhat mannercanthisanalysisbeimprovedupon?Lookatthedistributionofthecombined SATscoresinNYCschools.WhywasmodelingtheSATscoreswithaGaussian likelihoodfunctiona(very)badchoice?Whatcouldwehavedoneinstead? Ifbothapoorandarichpersonarewillingtobuyapairofsneakersfornomorethan $40,whovaluesthesneakersthemost,andwhoshouldgetthesneakersinorderfor thatresourcetobeallocatedmostefficiently?Couchyouranswerintermsofthe diminishingmarginalutilityofmoney.Whatwouldthelawofdiminishingmarginal utilitysayaboutthemostequitableincometaxschema,withrespecttodifferent incomelevels? Summary Thislastchapter—whichwasuncharacteristicallylightontheory—maybeoneofthe mostimportantchaptersinthewholebook.Inordertobeaproductivedataanalystusing R,yousimplymustbeacquaintedwiththetoolsandworkflowsofprofessionalR programmers. Thefirsttopicwetouchedonwasthelinkbetweenbestpracticesandreproducibility,and whyreproducibilityisanintegralpartofaproductiveandsaneanalyst’sworkflow.Next, wediscussedthebasicsofRscripting,andhowtoruncompletedscriptsallatonce.We sawhowRStudio—R’sbestIDE—canhelpuswhilewewritethesescriptsbyprovidinga mechanismtoexecutecode,line-by-line,aswewriteit.Toreallycementyour understandingofRscripting,wesawanexampleRscriptthatillustratedcleandesignand adherencetobestpractices(informativevariablenames,readablelayout,myriad informativecomments,andsoon.) Then,youlearnedofafewwaysthatyoucanorganizemulti-fileanalysisprojects.You sawhowthecorrectorganizationalstructureofanalysisprojectsnaturallylendthemselves tointegrationwithversioncontrol—apowerfultoolintheorganizedanalyst’sutilitybelt. Youlearnedhowthebenefitsconferredbyasophisticatedversioncontrolsystem—ability toreverttopreviousversions,trackallrevisions,andmergeincompatiblerevisions— couldpotentiallysaveananalystfromhoursofheartache. Finally,yousawhowtousetheRStudio/knitr/rmarkdownstacktohelpusachieveour goalsofproducingareproduciblereportofyouranalyses.Youlearnedthedangersofadhoc/copy-and-pastemanualreportgeneration,anddiscoveredthatabettersolutionisto chargeRwithcreatingthereportitself.Thesimplestsolution—compilinganotebook— was,atleast,betterthanmanualalternatives,butproducedreportsthatweresomewhat lackingintheflexibilityandaestheticsdepartments.Yousawthat,instead,wecanuseR Markdowntocreatefancy-pants,attractive,dynamically-generatedreportsthatcutdown onerrors,complementreproducibility,andaidintheeffectivedisseminationof information. Index A alphalevel(αlevel)/NullHypothesisSignificanceTesting AnalysisofCovariance(ANCOVA)/Logisticregression AnalysisofVariance(ANOVA) about/Testingmorethantwomeans assumptions/AssumptionsofANOVA anonymousfunctions/Functions arguments/Arithmeticandassignment arithmeticoperators/Arithmeticandassignment assignmentoperators/Arithmeticandassignment B baggedtreestechnique/Randomforests bagging/Randomforests bandwidth/Probabilitydistributions baseR/Visualizationmethods batchmode/Navigatingthebasics Bayesfactors/TheBayesianindependentsamplest-test Bayesianindependentsamplest-test performing/TheBayesianindependentsamplest-test Bayesianlinearregression/Advancedtopics bellcurve/Centraltendency betalevel(βlevel)/Whenthingsgowrong bias-variancetrade-off about/Thebias-variancetrade-off cross-validation/Cross-validation balance,striking/Strikingabalance binomialdistribution/NullHypothesisSignificanceTesting bivariaterelationship(twovariable)/Multivariatedata Bonferronicorrection/Testingmorethantwomeans bootstrapaggregating/Randomforests box-and-whiskerplot/Relationshipsbetweenacategoricalandacontinuousvariable C categoricalvariable andcontinuousvariable,relationshipbetween/Relationshipsbetweena categoricalandacontinuousvariable relationships,describing/Relationshipsbetweentwocategoricalvariables visualizationmethods/Categoricalandcontinuousvariables centraltendency measuring/Centraltendency characterdatatype/Logicalsandcharacters chi-squaredistribution/Testingindependenceofproportions chi-squaredstatistic/Testingindependenceofproportions circulardecisionboundary/Thecirculardecisionboundary classifier selecting/Choosingaclassifier verticaldecisionboundary/Theverticaldecisionboundary diagonaldecisionboundary/Thediagonaldecisionboundary crescentdecisionboundary/Thecrescentdecisionboundary circulardecisionboundary/Thecirculardecisionboundary Cohen’sd/Don’tbefooled! comments/Arithmeticandassignment ComprehensiveRArchiveNetwork(CRAN)/Workingwithpackages confidenceintervals using/Intervalestimation about/Howdidweget1.96? confusionmatrix/Confusionmatrices continuousvariable andcategoricalvariable,relationshipbetween/Relationshipsbetweena categoricalandacontinuousvariable relationships,describing/Therelationshipbetweentwocontinuousvariables covariance/Covariance correlationcoefficients/Correlationcoefficients multiplecorrelations,comparing/Comparingmultiplecorrelations continuousvariables visualizationmethods/Categoricalandcontinuousvariables controlledexperiment/Testingtwomeans correlationcoefficients/Correlationcoefficients costcomplexitypruning/Decisiontrees covariance/Covariance covariancematrix/Comparingmultiplecorrelations crescentdecisionboundary/Thecrescentdecisionboundary cross-tabulation/Relationshipsbetweentwocategoricalvariables crosstab/Relationshipsbetweentwocategoricalvariables D data loading,inR/LoadingdataintoR,Workingwithpackages dataformats about/Otherdataformats decisiontrees about/Decisiontrees degreesoffreedom/Populations,samples,andestimation diagonaldecisionboundary/Thediagonaldecisionboundary directionalhypothesis/Oneandtwo-tailedtests discretenumericvariable/Univariatedata E EmacsSpeaksStatistics(ESS)/RScripting ensemblelearning/Randomforests estimation/Populations,samples,andestimation F flowofcontrolconstructs/Flowofcontrol frequencydistributions about/Frequencydistributions examples/Frequencydistributions functions/Functions G Gaussiandistribution/Centraltendency GeneralizedLinearModel(GLM)/Logisticregression ggplot2 about/Visualizationmethods using/Visualizationmethods H hash-tag/Arithmeticandassignment help.start()function/GettinghelpinR Holm-Bonferronicorrection/Testingmorethantwomeans I ifelse()function/Advancedsubsetting imputation methods/Methodsofimputation multipleimputation/Multipleimputationinpractice independenceofproportions statisticalsignificance/Don’tbefooled! testing/Testingindependenceofproportions independent/Testingtwomeans independentsamplest-test using/Testingtwomeans assumptions/Assumptionsoftheindependentsamplest-test indexing/Subsetting IntegratedDevelopmentEnvironment(IDE)/RScripting interactionterms/Advancedtopics interquartilerange using/Spread intervalestimation about/Intervalestimation qnormfunction,using/Howdidweget1.96? inverselinkfunction/Logisticregression IterativelyRe-WeightedLeastSquares(IWLS)/Awordofwarning J JavascriptObjectNotation(JSON) about/UsingJSON JustifiedTrueBelief(JTB)/Exercises K k-NN using,inR/Usingk-NNinR limitations/Limitationsofk-NN k-NN,usinginR about/Usingk-NNinR confusionmatrices/Confusionmatrices kerneldensityestimation/Probabilitydistributions kitchensinkregression about/Kitchensinkregression Kruskal-Wallistest/Whatifmyassumptionsareunfounded? L lambdafunctions/Functions Last.fmdeveloper URL/UsingJSON left-tailed/Centraltendency linearmodels about/Linearmodels linearregression,diagnostics about/Linearregressiondiagnostics Anscomberelationship,second/SecondAnscomberelationship Anscomberelationship,third/ThirdAnscomberelationship Anscomberelationship,fourth/FourthAnscomberelationship linkfunction/Logisticregression logicaldatatype/Logicalsandcharacters logisticfunction/Logisticregression logisticregression about/Logisticregression using/Logisticregression using,inR/UsinglogisticregressioninR limitations/UsinglogisticregressioninR logitfunction/Logisticregression M machine/Usingabiggerandfastermachine Mann-WhitneyUtest/Whatifmyassumptionsareunfounded? matrix creating/Matrices about/Matrices MaximumLikelihoodEstimation(MLE)/Logisticregression meanheight estimating/Estimatingmeans MeanSquaredError(MSE)/Simplelinearregression measuresofspread forcategoricaldata/Spread missingdata analysis/Analysiswithmissingdata visualizing/Visualizingmissingdata types/Typesofmissingdata,Sowhichoneisit? methods,fordealing/Unsophisticatedmethodsfordealingwithmissingdata completecaseanalysis/Completecaseanalysis pairwisedistribution/Pairwisedeletion meansubstitution/Meansubstitution hotdeckimputation/Hotdeckimputation regressionimputation/Regressionimputation stochasticregressionimputation/Stochasticregressionimputation multipleimputation/Multipleimputation,Sohowdoesmicecomeupwiththe imputedvalues? out-of-boundsdata,checkingfor/Checkingforout-of-boundsdata columndatatype,checking/Checkingthedatatypeofacolumn unexpectedcategories,checking/Checkingforunexpectedcategories outliers,checkingfor/Checkingforoutliers,entryerrors,orunlikelydata points entryerrors,checking/Checkingforoutliers,entryerrors,orunlikelydata points unlikelydatapoints,checking/Checkingforoutliers,entryerrors,orunlikely datapoints outliers,checking/Checkingforoutliers,entryerrors,orunlikelydatapoints assertions,chaining/Chainingassertions multiplecorrelations comparing/Comparingmultiplecorrelations multiplemeans testing/Testingmorethantwomeans multipleregression about/Multipleregression multivariatedata about/Multivariatedata MusicBrainz URL/XML N negativelyskewed/Centraltendency NHST about/NullHypothesisSignificanceTesting defaulthypothesis/NullHypothesisSignificanceTesting one-tailedtest/Oneandtwo-tailedtests two-tailedtests/Oneandtwo-tailedtests TypeIerror/Whenthingsgowrong TypeIIerror/Whenthingsgowrong significance,warning/Awarningaboutsignificance p-values,warning/Awarningaboutp-values non-binarypredictor regressionwith/Regressionwithanon-binarypredictor non-linearmodeling/Advancedtopics normaldistribution about/Thenormaldistribution three-sigmarule/Thethree-sigmaruleandusingz-tables z-tables,using/Thethree-sigmaruleandusingz-tables fitting,toprecipitationdataset/FittingdistributionstheBayesianway NotaNumber(NaN)/Arithmeticandassignment notavailable(NA)/Subsetting nullhypothesisterminology/NullHypothesisSignificanceTesting O one-tailedhypothesistest running/Testingthemeanofonesample one-tailedtest/Oneandtwo-tailedtests onesamplet-test about/Testingthemeanofonesample assumptions/Assumptionsoftheonesamplet-test onlinerepositories about/Onlinerepositories OpenRefine/OpenRefine optimizedpackages using/Usingoptimizedpackages optimizing ways/Waittooptimize Out-Of-Bag(OOB)/Randomforests P p-value about/NullHypothesisSignificanceTesting warning/Awarningaboutp-values pairwiset-tests/Testingmorethantwomeans parallelization using/Useparallelization parallelR/GettingstartedwithparallelR,Anexampleof(some)substance parametricstatisticaltests/Whatifmyassumptionsareunfounded? Pearson’scorrelation/Correlationcoefficients polynomialregression/Thecirculardecisionboundary population/Populations,samples,andestimation positivelyskewed/Centraltendency power/Whenthingsgowrong predictfunction/Randomforests probabilitydensityfunction(PDF)/Probabilitydistributions probabilitydistributions about/Probabilitydistributions bandwidth,selecting/Probabilitydistributions probabilitymassfunction(PMF)/Probabilitydistributions pruning/Decisiontrees Q qnormfunction using/Howdidweget1.96? qplot(qplot)/Visualizationmethods quantile/Howdidweget1.96? quantile-quantileplot(QQ-plot) using/Whatifmyassumptionsareunfounded? R R about/Navigatingthebasics arithmeticoperators/Arithmeticandassignment assignmentoperators/Arithmeticandassignment logicaldatatype/Logicalsandcharacters characterdatatype/Logicalsandcharacters flowofcontrolconstructs/Flowofcontrol help,obtaining/GettinghelpinR data,loading/LoadingdataintoR,Workingwithpackages k-NN,using/Usingk-NNinR logisticregression/UsinglogisticregressioninR randomforests about/Randomforests rank assigning/Correlationcoefficients Rcode about/Besmartaboutyourcode,Besmarteraboutyourcode,Exercises memory,allocation/Allocationofmemory vectorization/Vectorization Rcpp using/UsingRcpp Read-Evaluate-Print-Loop(REPL)/Navigatingthebasics recursivesplitting/Decisiontrees regression/Correlationcoefficients regularexpressions/Regularexpressions regularization/Advancedtopics relationaldatabase about/RelationalDatabases ResidualSumofSquares(RSS)/Simplelinearregression results communicating/Communicatingresults right-tailed/Centraltendency Rimplementation using/UsinganotherRimplementation rnormfunction/Estimatingmeans RootMeanSquaredError(RMSE)/Simplelinearregression Rprojects about/Rprojects RScripting about/RScripting RStudio/RStudio running/RunningRscripts Rscripts running/RunningRscripts example/Anexamplescript RStudio about/RStudio S samples/Populations,samples,andestimation samplingdistribution/Thesamplingdistribution scatterplot/Therelationshipbetweentwocontinuousvariables scripting andreproducibility/Scriptingandreproducibility Shapiro-Wilktest/Whatifmyassumptionsareunfounded? simplelinearregression about/Simplelinearregression withbinarypredictor/Simplelinearregressionwithabinarypredictor warning/Awordofwarning Simpson’sParadox/Relationshipsbetweentwocategoricalvariables skewnessdegree/Centraltendency smallersamples/Smallersamples Spearman’srankcoefficient(rho)/Correlationcoefficients splitpoint/Decisiontrees spread measuring/Spread SQLquery/Whydidn’twejustdothatinSQL? standarddeviation/Spread standarderror/Thesamplingdistribution subsetting/Subsetting T t-distribution(Student’st-distribution) about/Smallersamples t-test/Testingthemeanofonesample teststatistic defining/NullHypothesisSignificanceTesting three-sigmarule/Thethree-sigmaruleandusingz-tables tidyr/tidyr trendline/Correlationcoefficients Tukey’svariation/Relationshipsbetweenacategoricalandacontinuousvariable tuningparameters/Decisiontrees TypeIerrors/Whenthingsgowrong TypeIIerrors/Whenthingsgowrong U univariatedata about/Univariatedata unsanitizeddata analysis/Analysiswithunsanitizeddata V vectorizedfunctions/Vectorizedfunctions vectors about/Vectors building/Vectors subsetting/Subsetting vectorizedfunctions/Vectorizedfunctions advancedsubsetting/Advancedsubsetting recycling/Recycling versioncontrol about/Versioncontrol verticaldecisionboundary/Theverticaldecisionboundary visualizationmethods about/Visualizationmethods ofcategoricaldata/Categoricalandcontinuousvariables ofcontinuousvariables/Categoricalandcontinuousvariables oftwocategoricaldata/Twocategoricalvariables oftwocontinuousvariables/Twocontinuousvariables ofmultiplecontinuousvariables/Morethantwocontinuousvariables VisualizingCategoricalData(VSD)/Twocategoricalvariables W Wilcoxonrank-sumtest/Whatifmyassumptionsareunfounded? X xbar/Centraltendency XML about/XML XPath URL/XML Z z-scores/Correlationcoefficients z-tables using/Thethree-sigmaruleandusingz-tables