R LANGUAGE OVERVIEW
I. Introduction to R
We use "data.csv" (which you can download here), introduced in the Data Manipulation section.
1. About R
R is a free software environment for statistical computing and graphics. It has been popular in academia more than 20 years because of its strong data analysis features. You can check more of the history about and comments on R on its official website.
2. Install R User Interface - RStudio
Software: Before you can download and use R Studio, you will need to download and install R. In this process, you need to find a mirror and download. For instance, you can use the UCLA mirror and download. Please make sure that you down a .pkg file and install by clicking on it. Download: then, you can visit this webpage and download R Studio for free. Installation process: You can see this post for details or refer to this video for Mac users or this video for Windows users. Open the UI: You may click on the R Studio Button to open the user interface after the download and installation.
3. Import and Export Data using R
In R, you can directly import and export data. If we have a data file called "data.csv", you can put it into the same folder of the R or RMD file. You can use the first line of code below for import data and second line of the code below for export data.
II. Data Cleaning and Data Manipulation in R
We use "person_info.csv" (which you can download here), introduced in the Data Manipulation for this section.
1. Check and Clean Missing Values / NA
You can use the is.na() and which() function to locate missing values:
Please click on the READ MORE to read more details about finding NAs and how to impute them.
2. Check and Convert Data Types
- You can use str(), summary(), class(), and typeof() functions to check data type:
- You can use as.numeric(), as.logical(), as.integer(), as.double(), as.factor(), and as.character() functions to convert data types:
Please click on READ MORE to read more details about checking and manipulating data types.
3. Check and Clean Outliers
A point is defined as an outlier if it is less than Q1−1.5×IQR (interquartile range) or above Q3+1.5×IQR, where Q1 is 25% quantile of the data and Q3 is 75% quantile of the data.
4. Advanced Data Manipulation - datetime and string manipulations
(a) datetime Manipulation
In R, we recommend using the lubridate library for manipulating datatime objects.
(b) string Manipulation
In R, we recommend using the stringr library for string manipulation. You can use this library to normalize strings and to approximate matchings.
Please click on READ MORE to read more details about advanced data manipulation for datetime and string objects.