Chapter 6 Data Exploration
Data exploration is an essential part of data analytics, the preliminary step to data wrangling techniques, which will be covered in the next three chapters. Similar to other areas in R there are many different approaches to data exploration. This section will explore non data visualization data exploration techniques.
You can have data without information, but you cannot have information without data
Daniel Keys Moran
6.1 Selecting Data Columns
The size of data continues to grow, both in the number of rows and columns in a dataframe. Wide data, that with many columns, can provide challenges to data exploration, making it important to be able to limit and specify the columns that are used during data exploration.
6.1.1 Select data by column name
Description | |
---|---|
Method to subset data using a specific column name |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, column_name) package
Actual Instructions
::select(df, assessment_account_number) dplyr
6.1.2 Select data by multiple column names
Description | |
---|---|
Method to subset data using multiple column names |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, column_name1, column_name2, column_name3, column_name4) package
Actual Instructions
::select(df, assessment_account_number, civic_city_name, sale_price, sale_date) dplyr
6.1.3 Select data by column index
Description | |
---|---|
Method to subset data using specific column index numbers |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, number) package
Actual Instructions
::select(df, 2) dplyr
6.1.4 Select data by multiple column index
Description | |
---|---|
Method to subset data using a specific column index number |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, number, number, number, number) package
Actual Instructions
::select(df, 2, 8, 9, 10) dplyr
6.1.5 Select data by column index range
Description | |
---|---|
Method to subset data in a range of column index numbers |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, number:number) package
Actual Instructions
::select(df, 1:4) dplyr
6.1.6 Select data by column index and column index range
Description | |
---|---|
Method to subset data by specific column index numbers and a range of column index numbers |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, number, number, number, number:number) package
Actual Instructions
::select(df, 2, 6, 11, 8:10) dplyr
6.1.7 Select all but last column
Description | |
---|---|
Method to retrieve all columns except the last column |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, -1) package
Actual Instructions
::select(df, -1) dplyr
6.2 Filtering Data
Similar to reducing the number of columns during data exploration, it is important to be able to reduce the number of rows, which can be performed by creating a subsection of data based on a single, or multiple, values.
6.2.1 Filter single column by character value
Description | |
---|---|
Method to filter a single column of character data when equal to a specific value |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, column_name == "text") package
Actual Instructions
::select(df, civic_city_name == "DARTMOUTH") dplyr
6.2.2 Filter single column by numeric value
Description | |
---|---|
Method to filter a single column of numeric data when equal to a specific value |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, column_name == number) package
Actual Instructions
::filter(df, parcels_in_sale == 1) dplyr
6.2.3 Filter single column by multiple character values
Description | |
---|---|
Method to filter a single column of character data with multiple character values |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, column_name %in% c("text", "text", "text")) package
Actual Instructions
::filter(df, civic_city_name %in% c("DARTMOUTH", "COLE HARBOUR", "PORTERS LAKE")) dplyr
6.2.4 Filter single column by multiple numeric values
Description | |
---|---|
Method to filter a single column of numeric data with multiple numeric values |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, column_name %in% c(number, number, number)) package
Actual Instructions
::filter(df, parcels_in_sale %in% c(2, 3, 5)) dplyr
6.2.5 Filter single column by range of numeric values
Description | |
---|---|
Method to filter numeric data within a range of values |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, package::function(column, low value, high value)) package
Actual Instructions
::filter(df, dplyr::between(sale_price, 50000, 100000)) dplyr
6.2.6 Filter numeric data larger than
Description | |
---|---|
Method to filter number data that is larger than a value |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, column > number) package
Actual Instructions
::filter(df, sale_price > 250000) dplyr
6.2.7 Filter numeric data smaller than
Description | |
---|---|
Method to filter numeric data that is smaller than a value |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, column < number) package
Actual Instructions
::filter(df, sale_price < 50000) dplyr
6.2.8 Filter multiple columns
Description | |
---|---|
Method to filter data using multiple columns when filtering of all columns execute to be true |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, column1 == "text", column2 == number) package
Actual Instructions
::filter(df, civic_city_name == "DARTMOUTH", parcels_in_sale == 2) dplyr
6.2.9 Filter when multiple character values not in a column
Description | |
---|---|
Method to filter data when specific text values are not present in a specified column |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
::function(data, !column_name %in% c("text1", "text2", "text3")) package
Actual Instructions
::filter(df, !civic_city_name %in% c("DARTMOUTH", "COLE HARBOUR", "PORTERS LAKE")) dplyr
6.3 Combining Commands
6.3.1 Select specific columns and filter column by character value
Description | |
---|---|
Method to combine a select and filter expression in a single command |
Ingredients | |
---|---|
Package | Data |
readr |
sample.csv |
Preparation
<- readr::read_csv("C:/data/sample_parcel_sales_history.csv") df
Sample Instructions
%>%
data ::function(column_name1 %in% c("text1", "text2", "text3")) %>%
package::function(column_name1, column_name2, column_name3) package
Actual Instructions
%>%
df ::filter(civic_city_name %in% c("DARTMOUTH", "COLE HARBOUR", "PORTERS LAKE")) %>%
dplyr::select(assessment_account_number, sale_price, sale_date) dplyr