Skip to the content.

Data Analysis Using R

This file contains basic instructions for data anlysis using R.



R is a programming language and free software environment for statistical computing and graphics. It is supported by the R Core Team and the R Foundation for Statistical Computing. It is widely used among statisticians and data miners for developing statistical software and data analysis.


Table of contents

Basics of R

Common Commands for Data Management

  1. Adding a different variable or column to a dataset: mutate(data,new_var = )

     library(tidyverse) #packages shouls be loaded before use
        
     my_data2 = mutate(my_data1, ratio = childid/momeduc) # This creates a new data set with a variable called ratio that equal the ratio of childid/momeduc while preserving all other variable form dataset: my_data1
    
     my_data1 = mutate(my_data1, ratio1 = childid * momeduc *5) #This adds a new variable called ratio 1 in the same data set.
    

    Make sure you load the specific package before you use specific commands form it.

  2. Selecting specific variables and creating new data set select()

    library(tidyverse)
    
    select(my_data1, c("childid", "socio5", "mumeduc")) #This select only childid, socio5, and mumeduc.
    
    my_data2 = select(my_data1, c("childid", "socio5", "mumeduc")) #This creates a new dataset: my_data2 with only three variable
    
  3. Selecting specific data based on given condition `filter(data, var_condition)’

    filter(my_data1, sex ==2) # gives data for all variables in data set: my_data1 for which sex is 
    
    filter(my_data1, childid>=50) # gives data for all variables in data set: my_data1 for which childid is equal or greater than 50
    
  4. Use of pipe (%>%) to filter and many more

    my_data1 %>%
    filter(childid >=50)) # will do as in previous code
    
    my_data1%>%
    filter(childid >= 50) %>%
    select(c("childid", "socio5", "momeduc")) # will first filter the data for which childid is greater than or equal to 50 and selects three variables: childid, socio5, and momeduc from my_data1. Using pipe we can operate multiple function in single operation.
    

Data Visualitation and Graphics

Using built in R functions

  1. Scatter plot plot(x,y)

    plot(my_data1$bweight,my_data$msdp)
    

  1. Histogram hist(data$var)

    hist(my_data1$momeduc)
    


  1. Bar plot barplot(data$var, xlab = " ", ylab = " ", col = " ")

    barplot(my_data1$momage[1:10],xlab = "momage", ylab = "Frequency", main = "Barplot of momage", col = "orange") # gives barplot for first 10 observation of momage from my_data1. main gives title. This is a big data set and barplot for every observation looks messy. That is why I chose first 10 observations
    


  1. Boxplot ` boxplot(data$var, xlab = “”, ylab = “”, main = “”, col = “”)`

    boxplot(my_data1$momage, xlab = "momage", ylab = "Age", main = "Boxpot of momage", col = "red") #plots the boxplot of momage form my_data1
    

    boxplot(my_data1$momage ~ my_data1$socio5, xlab = "momage", ylab = "Age", main = "Boxpot of momage", col = rainbow(5))  #plots the boxplot of momage by socio5 form my_data1. col = rainbow(n) gives different color and n is number of color determined by number of boxplot
    

  2. Piechart `pie(data$var, xlab = “”, radious = “”, main = “”, col = “”, clockwise)

    pie(my_data1$socio5[1:5], labels = c("A", "B", "C", "D", "E"), main = "Piechart of momage", col = rainbow(5)) # creates piechart for first five observation of socio 5.
    

Using ggplot2

Stastistical Tools