Top 15 Commands for Data Manipulation in R

Posted on Posted in Blog

R is one of the most widely used programming languages for data and statistical analysis. At eMumba we use R heavily to make sense out of data, to find patterns and for general exploratory data analysis. Generally, results of these analyses are fed into machine learning models to solve various classification and regression problems. In this post, I want to share top 15 commands that come very handy while doing data exploration through R.

Infographics copy-2

  1. Reading/Writing CSV file

    You can read data from a CSV file and load it into a table like structure called data frame.

    0
    1 csv_file = read.csv("/file/path.csv", header = FALSE)
    2
    

    Similarly, you can write contents of a data frame into a file.

    0
    1write.csv(data_frame, file = "/file/path.csv", row.names=FALSE)
    2
    
  2. Print first 5 and last 5 rows of a data frame

    0
    1head(data_frame) # prints first 5 rows of?a data frame
    2tail(data_frame) # prints last?5 rows of a data frame
    3
  3. Print number of rows in a data frame

    0
    1hnrow(data_frame)
    2
  4. Combining two data frames into one

    combines data frame first and second into combined.

    0
    1combined = rbind(first, second)
    2
    
  5. Filter a data frame

    Filters a data frame by applying which command on a particular column. As a result, filtered data frame will have all the records where column_name has a value greater than 1.

    0
    1filetered = data_frame[which(data_frame$column_name > 1), ]
    2
  6. Print column summary

    The result of the summary command depends on the type of object passed. In the case of a numeric column, the summary will tell you things like minimum value in that column, maximum value, and interquartile range.

    0
    1summary(data_frame$column_name)
    2
  7. Create a new column using existing column values

    0
    1data$ratio = as.numeric(data$weight/data$height)
    2
  8. Replace all specific values in a column with another value

    0
    1data$some_column[data$some_column == 'old_value'] 'new_value')
    2
    
  9. Format a column as date time

    0
    1data$start_time = as.POSIXct(data$start_time, format="%d/%m/%Y %H:%M:%S")
    2 //# this enables you to do filtering as shown below
    3filtered = data[which(data$start_time >= "2017-02-18 16:00:00"), ]
    4
  10. Find percentage contribution of each distinct value in a column

    0
    1as.data.frame(prop.table(table(data_frame$column_name)) * 100)
    2
  11. Find data types of each column

    0
    1sapply(data_frame, class)
    2
    
  12. Find all distinct values from a column

    0
    1unique(data_frame$column_name)
    2
    
  13. Rename a column

    0
    1colnames(data_frame)[column_index] "new_name"
    2
    
  14. Stratified sampling

    A common task in solving machine learning problems is to divide the data set into various subsets while maintaining the original proportions of various labels in each subset.
    This can be done in R using following commands

    0
    1# you would need caTools package
    2install.packages("caTools")
    3library(caTools)
    4sampled_data = sample.split(data_frame$column_for_sampling, SplitRatio=0.7)
    5training_set = sampled_data[sampled_data,]
    6test_set = sampled_data[!sampled_data,]
    7
    
  15. Scatter plot

    R has a very rich library for data visualizations. A scatter plot can be drawn using a single command as shown below

    0
    1plot(data_frame$column1, data_frame$column2, main="title", xlab="x-label ", ylab="y-label", pch=19, col=factor(data_frame$color_column))
    2

Not enough, You can find even more handy commands and various tips and tricks on this GitHub page.

Leave a Reply

Your email address will not be published. Required fields are marked *