R is one of the most widely used programming languages for data and statistical analysis. At eMumba we use R heavily to make sense out of data, to find patterns and for general exploratory data analysis. Generally, results of these analyses are fed into machine learning models to solve various classification and regression problems. In this post, I want to share top 15 commands that come very handy while doing data exploration through R.
Reading/Writing CSV file
You can read data from a CSV file and load it into a table like structure called data frame.
0 1 csv_file = read.csv("/file/path.csv", header = FALSE) 2
Similarly, you can write contents of a data frame into a file.
0 1write.csv(data_frame, file = "/file/path.csv", row.names=FALSE) 2
Print first 5 and last 5 rows of a data frame
0 1head(data_frame) # prints first 5 rows of?a data frame 2tail(data_frame) # prints last?5 rows of a data frame 3
Print number of rows in a data frame
0 1hnrow(data_frame) 2
Combining two data frames into one
combines data frame first and second into combined.
0 1combined = rbind(first, second) 2
Filter a data frame
Filters a data frame by applying which command on a particular column. As a result, filtered data frame will have all the records where column_name has a value greater than 1.
0 1filetered = data_frame[which(data_frame$column_name > 1), ] 2
Print column summary
The result of the summary command depends on the type of object passed. In the case of a numeric column, the summary will tell you things like minimum value in that column, maximum value, and interquartile range.
0 1summary(data_frame$column_name) 2
Create a new column using existing column values
0 1data$ratio = as.numeric(data$weight/data$height) 2
Replace all specific values in a column with another value
0 1data$some_column[data$some_column == 'old_value'] 'new_value') 2
Format a column as date time
0 1data$start_time = as.POSIXct(data$start_time, format="%d/%m/%Y %H:%M:%S") 2 //# this enables you to do filtering as shown below 3filtered = data[which(data$start_time >= "2017-02-18 16:00:00"), ] 4
Find percentage contribution of each distinct value in a column
0 1as.data.frame(prop.table(table(data_frame$column_name)) * 100) 2
Find data types of each column
0 1sapply(data_frame, class) 2
Find all distinct values from a column
0 1unique(data_frame$column_name) 2
Rename a column
0 1colnames(data_frame)[column_index] "new_name" 2
A common task in solving machine learning problems is to divide the data set into various subsets while maintaining the original proportions of various labels in each subset.
This can be done in R using following commands
0 1# you would need caTools package 2install.packages("caTools") 3library(caTools) 4sampled_data = sample.split(data_frame$column_for_sampling, SplitRatio=0.7) 5training_set = sampled_data[sampled_data,] 6test_set = sampled_data[!sampled_data,] 7
R has a very rich library for data visualizations. A scatter plot can be drawn using a single command as shown below
0 1plot(data_frame$column1, data_frame$column2, main="title", xlab="x-label ", ylab="y-label", pch=19, col=factor(data_frame$color_column)) 2
Not enough, You can find even more handy commands and various tips and tricks on this GitHub page.