Chapter 6 Data Exploration

Data exploration is an essential part of data analytics, the preliminary step to data wrangling techniques, which will be covered in the next three chapters. Similar to other areas in R there are many different approaches to data exploration. This section will explore non data visualization data exploration techniques.

You can have data without information, but you cannot have information without data
Daniel Keys Moran

6.1 Selecting Data Columns

The size of data continues to grow, both in the number of rows and columns in a dataframe. Wide data, that with many columns, can provide challenges to data exploration, making it important to be able to limit and specify the columns that are used during data exploration.

6.1.1 Select data by column name

Description
Method to subset data using a specific column name
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, column_name)


Actual Instructions

dplyr::select(df, assessment_account_number)

6.1.2 Select data by multiple column names

Description
Method to subset data using multiple column names
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, column_name1, column_name2, column_name3, column_name4)


Actual Instructions

dplyr::select(df, assessment_account_number, civic_city_name, sale_price, sale_date)

6.1.3 Select data by column index

Description
Method to subset data using specific column index numbers
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, number)


Actual Instructions

dplyr::select(df, 2)

6.1.4 Select data by multiple column index

Description
Method to subset data using a specific column index number
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, number, number, number, number)


Actual Instructions

dplyr::select(df, 2, 8, 9, 10)

6.1.5 Select data by column index range

Description
Method to subset data in a range of column index numbers
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, number:number)


Actual Instructions

dplyr::select(df, 1:4)

6.1.6 Select data by column index and column index range

Description
Method to subset data by specific column index numbers and a range of column index numbers
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, number, number, number, number:number)


Actual Instructions

dplyr::select(df, 2, 6, 11, 8:10)

6.1.7 Select all but last column

Description
Method to retrieve all columns except the last column
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, -1)


Actual Instructions

dplyr::select(df, -1)

6.2 Filtering Data

Similar to reducing the number of columns during data exploration, it is important to be able to reduce the number of rows, which can be performed by creating a subsection of data based on a single, or multiple, values.

6.2.1 Filter single column by character value

Description
Method to filter a single column of character data when equal to a specific value
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, column_name == "text")


Actual Instructions

dplyr::select(df, civic_city_name == "DARTMOUTH")

6.2.2 Filter single column by numeric value

Description
Method to filter a single column of numeric data when equal to a specific value
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, column_name == number)


Actual Instructions

dplyr::filter(df, parcels_in_sale == 1)

6.2.3 Filter single column by multiple character values

Description
Method to filter a single column of character data with multiple character values
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, column_name %in% c("text", "text", "text"))


Actual Instructions

dplyr::filter(df, civic_city_name %in% c("DARTMOUTH", "COLE HARBOUR", "PORTERS LAKE"))

6.2.4 Filter single column by multiple numeric values

Description
Method to filter a single column of numeric data with multiple numeric values
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, column_name %in% c(number, number, number))


Actual Instructions

dplyr::filter(df, parcels_in_sale %in% c(2, 3, 5))

6.2.5 Filter single column by range of numeric values

Description
Method to filter numeric data within a range of values
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, package::function(column, low value, high value))


Actual Instructions

dplyr::filter(df, dplyr::between(sale_price, 50000, 100000))

6.2.6 Filter numeric data larger than

Description
Method to filter number data that is larger than a value
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, column > number)


Actual Instructions

dplyr::filter(df, sale_price > 250000)

6.2.7 Filter numeric data smaller than

Description
Method to filter numeric data that is smaller than a value
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, column < number)


Actual Instructions

dplyr::filter(df, sale_price < 50000)

6.2.8 Filter multiple columns

Description
Method to filter data using multiple columns when filtering of all columns execute to be true
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, column1 == "text", column2 == number)


Actual Instructions

dplyr::filter(df, civic_city_name == "DARTMOUTH", parcels_in_sale == 2)

6.2.9 Filter when multiple character values not in a column

Description
Method to filter data when specific text values are not present in a specified column
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

package::function(data, !column_name %in% c("text1", "text2", "text3"))


Actual Instructions

dplyr::filter(df, !civic_city_name %in% c("DARTMOUTH", "COLE HARBOUR", "PORTERS LAKE"))

6.3 Combining Commands

6.3.1 Select specific columns and filter column by character value

Description
Method to combine a select and filter expression in a single command
Ingredients
Package Data

readr
dplyr

sample.csv


Preparation

df <- readr::read_csv("C:/data/sample_parcel_sales_history.csv")


Sample Instructions

data %>%
  package::function(column_name1 %in% c("text1", "text2", "text3")) %>%
  package::function(column_name1, column_name2, column_name3)


Actual Instructions

df %>%
  dplyr::filter(civic_city_name %in% c("DARTMOUTH", "COLE HARBOUR", "PORTERS LAKE")) %>%
  dplyr::select(assessment_account_number, sale_price, sale_date)