Chapter 6 Data Exploration
Data exploration is an essential part of data analytics, the preliminary step to data wrangling techniques, which will be covered in the next three chapters. Similar to other areas in R there are many different approaches to data exploration. This section will explore non data visualization data exploration techniques.
You can have data without information, but you cannot have information without data
Daniel Keys Moran
6.1 Selecting Data Columns
The size of data continues to grow, both in the number of rows and columns in a dataframe. Wide data, that with many columns, can provide challenges to data exploration, making it important to be able to limit and specify the columns that are used during data exploration.
6.1.1 Select data by column name
| Description | |
|---|---|
| Method to subset data using a specific column name |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, column_name)Actual Instructions
dplyr::select(df, assessment_account_number)6.1.2 Select data by multiple column names
| Description | |
|---|---|
| Method to subset data using multiple column names |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, column_name1, column_name2, column_name3, column_name4)Actual Instructions
dplyr::select(df, assessment_account_number, civic_city_name, sale_price, sale_date)6.1.3 Select data by column index
| Description | |
|---|---|
| Method to subset data using specific column index numbers |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, number)Actual Instructions
dplyr::select(df, 2)6.1.4 Select data by multiple column index
| Description | |
|---|---|
| Method to subset data using a specific column index number |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, number, number, number, number)Actual Instructions
dplyr::select(df, 2, 8, 9, 10)6.1.5 Select data by column index range
| Description | |
|---|---|
| Method to subset data in a range of column index numbers |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, number:number)Actual Instructions
dplyr::select(df, 1:4)6.1.6 Select data by column index and column index range
| Description | |
|---|---|
| Method to subset data by specific column index numbers and a range of column index numbers |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, number, number, number, number:number)Actual Instructions
dplyr::select(df, 2, 6, 11, 8:10)6.1.7 Select all but last column
| Description | |
|---|---|
| Method to retrieve all columns except the last column |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, -1)Actual Instructions
dplyr::select(df, -1)6.2 Filtering Data
Similar to reducing the number of columns during data exploration, it is important to be able to reduce the number of rows, which can be performed by creating a subsection of data based on a single, or multiple, values.
6.2.1 Filter single column by character value
| Description | |
|---|---|
| Method to filter a single column of character data when equal to a specific value |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, column_name == "text")Actual Instructions
dplyr::select(df, civic_city_name == "DARTMOUTH")6.2.2 Filter single column by numeric value
| Description | |
|---|---|
| Method to filter a single column of numeric data when equal to a specific value |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, column_name == number)Actual Instructions
dplyr::filter(df, parcels_in_sale == 1)6.2.3 Filter single column by multiple character values
| Description | |
|---|---|
| Method to filter a single column of character data with multiple character values |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, column_name %in% c("text", "text", "text"))Actual Instructions
dplyr::filter(df, civic_city_name %in% c("DARTMOUTH", "COLE HARBOUR", "PORTERS LAKE"))6.2.4 Filter single column by multiple numeric values
| Description | |
|---|---|
| Method to filter a single column of numeric data with multiple numeric values |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, column_name %in% c(number, number, number))Actual Instructions
dplyr::filter(df, parcels_in_sale %in% c(2, 3, 5))6.2.5 Filter single column by range of numeric values
| Description | |
|---|---|
| Method to filter numeric data within a range of values |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, package::function(column, low value, high value))Actual Instructions
dplyr::filter(df, dplyr::between(sale_price, 50000, 100000))6.2.6 Filter numeric data larger than
| Description | |
|---|---|
| Method to filter number data that is larger than a value |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, column > number)Actual Instructions
dplyr::filter(df, sale_price > 250000)6.2.7 Filter numeric data smaller than
| Description | |
|---|---|
| Method to filter numeric data that is smaller than a value |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, column < number)Actual Instructions
dplyr::filter(df, sale_price < 50000)6.2.8 Filter multiple columns
| Description | |
|---|---|
| Method to filter data using multiple columns when filtering of all columns execute to be true |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, column1 == "text", column2 == number)Actual Instructions
dplyr::filter(df, civic_city_name == "DARTMOUTH", parcels_in_sale == 2)6.2.9 Filter when multiple character values not in a column
| Description | |
|---|---|
| Method to filter data when specific text values are not present in a specified column |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
package::function(data, !column_name %in% c("text1", "text2", "text3"))Actual Instructions
dplyr::filter(df, !civic_city_name %in% c("DARTMOUTH", "COLE HARBOUR", "PORTERS LAKE"))6.3 Combining Commands
6.3.1 Select specific columns and filter column by character value
| Description | |
|---|---|
| Method to combine a select and filter expression in a single command |
| Ingredients | |
|---|---|
| Package | Data |
readr |
sample.csv |
Preparation
df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")Sample Instructions
data %>%
package::function(column_name1 %in% c("text1", "text2", "text3")) %>%
package::function(column_name1, column_name2, column_name3)Actual Instructions
df %>%
dplyr::filter(civic_city_name %in% c("DARTMOUTH", "COLE HARBOUR", "PORTERS LAKE")) %>%
dplyr::select(assessment_account_number, sale_price, sale_date)