Data Analytics with R: A Recipe Book
Chapter 1 Intro
In 2018 I made my first New Year’s resolution in over 30 years, to use R exclusively at work; the challenge with this resolution was that I did not know how to write code, alone write code in R. At that time I was leading an geospatial analytics team at the City of Toronto and I decided to undertake all my analysis work, both geospatial and non-geospatial, as a code first approach using R. My venture and journey was a challenge, to say the least. Undertaking a code first analysis approach in R was a very difficult journey for me, one that had many ups and downs. Along the way I was helped and encourage by many people, starting with my best friend Dr. Matt Adams where we spent a long Friday night going page-by-page through the R for Data Science book by Hadley Wickham & Garrett Grolemund at Matt’s dinning room table, to Sharla Gelfand and Geoffrey Hunter who provided tremendous encouragement, and of course the amazing R community. With all the positive experiences I also encountered countless stumbling blocks; from learning material designed for people who have an existing literacy level on the subject, a feeling that many authors forget what it was like to be at the start of the learning journey, lengthy and difficult material to consume, to formal learning that is structured as a linear path, while as an adult learner in a working environment my journey was non-linear and based on the challenge of the day. The combination of my stumbling blocks and my lack of formal education in a field of quantitative analysis or formal computer programming training developed into a serious inferiority complex. However one evening during the midst of COVID-19 lockdown while watching a YouTube video by Johnny Harris titled “The Fastest Way to Learn a New Language: The Video Game Map Theory” I was inspired to turn the experiences, successes and challenges, gained over my four year R journey into a book designed to help those who are on their own R data analytics journey.
Just what the world needs another data analytics book focusing on R. That is where this book, I am hoping, is a little different. The structure and design of this book is based on iterative learning, starting with the most basic and build by adding one new element concept. Throughout my life I had a challenge learning highly technical subjects, however I have been able to over come the challenge by trying to make a 1-to-1 comparison based on a four part idea:
- concept (what)
- purpose (why)
- structure (how)
- example (how)
Building on a central theme presented in the video by Johnny Harris linear learning styles with the sole aim of becoming fluent in a language may result in the inability to perform basic and tangible actions desired by those wanting to communicate within that language, and thinking of adult learners and individuals learning on the job who are looking for help on a specific challenge, the book has been structured to be small easily consumable chunks similar to that of a recipe card. The concept for a recipe card is that they are self contained, providing all the ingredients, preparation, and instructions required to create a meal. While a cookbook may consist of many recipes, there is no expectation to read, understand, and master all the recipes in order to prepare a meal. Following this as the central theme the book, it has been designed as a number of data analytics recipes focusing on the R language.
1.1 How to Use the Book
The book has a natural progression built around a five course meal starting at the very beginning for individuals with no prior experience coding R, but looking to join the R data analytics community. In order to take advantage of the book you should have access R and RStudio (either locally installed or available via a cloud provider) and have some comfortable within the RStudio software environment. To follow along with the recipes I would recommend working in either a R Notebook or a R Markdown document; both can be created within RStudio by clicking File - New File and selecting either the R Notebook or R Markdown… option.
In R there are many ways to undertake and tackle a problem, each with their own pros and cons. The examples, or recipes, in this book are by no means the only, best, or most efficient means of performing a certain data analytics task. The approach taken is to build a base that can be easily built on using a similar style, throught, and coding syntax.
The sections are independent of each other, allowing a learner to begin according to their individual R data analytics journey. The book is broken into five sections:
An example design of a R data analytics recipe card.
Select data by multiple column names
|Method to subset data using multiple column names|
:function(dataframe, column_name1, column_name2, column_name2)package
::select(countries, continent, capital, population)dplyr
I love spicy food; tacos, burritos, chili, curry, hot sauce, almost anything, and I add hot sauce to almost everything. When beginning to eat spicy food you generally do not start with a Carolina Reaper, ghost pepper, or a habanero; most people may start with adding Frank’s Hot Sauce, a little chili powder, or even some fresh jalapenos to a meal. Similar to the progression of spicy food a spicy index was used to help communicate the relative difficulty of the analytical concept or technical coding difficulty.
Not spicy: Not technically difficult subject matter
Mild spicy: Minimal technically difficult subject matter
Moderate spicy: Moderate technically difficult subject matter
Very spicy: Technically difficult subject matter
1.2 Data Used within the Book
dataZONE property and municipal datasets from across Nova Scotia
Four datasets related to Nova Scotia properties will be used throughout the book:
- residential dwelling characteristics
- assessed value and taxable assessed value history
- parcel land sizes
- parcel sales history
I have selected a sub sample for each of the four datasets to reduce the size of the data, which is available here
The full datasets, including others, can be accessed from the dataZONE data catalog
Like with most, and as stated earlier, I was inspired by many different sources during the ideation, design, and creation of this book. The following are a selected resources that provided the inspiration to help create this book.
R for Data Science by Hadley Wickham & Garrett Grolemund
Text Mining with R by Julia Silge & David Robinson
Geocomputation with R by Robin Lovelace, Jakub Nowosad & Jannes Muenchow
Cookshelf Chinese by Jenny Stacey
The Fastest Way to Learn a New Language: The Video Game Map Theory by Johnny Harris
1.4 R Journey Recommendations
Everyone tackles problems and learns differently. Reflecting on my R journey I would recommend the following to anyone beginning their journey:
- start small
- choose a real world problem that you are interested in (i.e. school, work, or personally interesting)
- break the problem into chunks that allow you to see progress
- in the beginning prioritize making it work over highly optimized and efficient code
- build off the knowledge you develop
- don’t be afraid of making mistakes
- ask for help
- celebrate your successes, regardless of how small they may seem
- most of all, HAVE FUN!!!