How to Analyze Data with R: A Complete Beginner Guide to dplyr (2024)

[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to Analyze Data with R: A Complete Beginner Guide to dplyr (1)

Datasets often require many work hours to understand fully. R makes this process as easy as possible through the dplyr package – the easiest solution for code-based data analysis. You’ll learn how to use it today.

Are you completely new to R?Here’s our beginner R guide for programmers.

You’ll use theGapminder datasetthroughout the article. It’s available through CRAN, so make sure to install it. Here’s how to load in all required packages:

Here’s how the first couple of rows of the Gapminder dataset look like:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (2)

Image 1 – Gapminder dataset head

And that’s all you need to start analyzing.

Today you’ll learn about:

Column Selection

More often than not, you don’t need all dataset columns for your analysis. R’s dplyr provides a couple of ways to select columns of interest. The first one is more obvious – you pass the column names inside the select() function.

Here’s how to use this syntax to select a couple of columns:

Here are the results:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (3)

Image 2 – Column selection method 1

But what if you have dozens of columns and want to select all but a few? There’s a better way – specify the columns you don’t need with a minus sign (-) as a prefix:

Here are the results:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (4)

Image 3 – Column selection method 2

As you can see, thecontinentcolumn is the only one that isn’t shown. And that’s all you should know about column selection. Let’s proceed with data filtering.

Data Filtering

Filtering datasets is one of the most common operations you’ll do on your job. Not all data is relevant at a given time. Sometimes you need values for a particular product or its sales figures in Q1. Or both. That’s where the filter() function comes in handy.

Here’s how to display results only for 2007:

The results are shown below:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (5)

Image 4 – Data filtering example – year = 2007

You can nest multiple filter conditions inside a single filter() function. Just make sure to separate the conditions by a comma. Here’s how to select a record for Poland in 2007:

Here are the results:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (6)

Image 5 – Data filtering example – year = 2007, country = Poland

But what if you want results for multiple countries? You can use the %in% keyword for the task. The snippet below shows records for 2007 for Poland and Croatia:

Here are the results:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (7)

Image 6 – Data filtering example – year = 2007, country = (Poland, Croatia)

If you understand these examples, you understand data filtering. Let’s continue with data ordering.

Data Ordering

Sometimes you want your data ordered by a specific column(s) value. For example, you might want to sort users by age or students by score, either in ascending or descending order. You can easily implement this behavior with dplyr – with its built-in arrange() function.

Here’s how to arrange the results by life expectancy:

The results are shown below:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (8)

Image 7 – Data ordering example 1

As you can see, data is ordered by thelifeExpcolumn ascendingly. Most cases require descending ordering. Here’s how you can implement it:

Here are the results:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (9)

Image 8 – Data ordering example 2

Sometimes you want only a couple of rows returned. The top_n() function lets you specify how many rows should be displayed. Here’s an example:

The results are shown in the following image:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (10)

Image 9 – Data ordering example 9

And that’s it with regards to the ordering. Next up – derived columns.

Creating Derived Columns

With dplyr, you can use the mutate() function to create new attributes. The new attribute name is put on the left side of the equal sign, and the contents on the right – just as if you were to declare a variable.

The example below calculates GDP as a product of population and GDP per capita and stores it in a dedicated column. Some other transformations are made along the way:

Here are the results:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (11)

Image 10 – Calculating GDP as (population * GDP per capita)

Instead of mutate(), you can also use transmute(). There’s one severe difference – transmute() keeps only the derived column. Let’s use it in the example from above:

The results are shown below:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (12)

Image 11 – Calculating GDP with transmute() – all other columns are dropped

You’ll use mutate() more often, but knowing additional functions can’t hurt.

Calculating Summary Statistics

Summary statistics don’t need any introduction. In many cases, you need to calculate a simple average of a column. Here’s how to calculate average life expectancy among the entire dataset:

Here are the results:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (13)

Image 12 – Calculating average life expectancy of the entire dataset

As you would imagine, you can chain other functions to calculate summary statistics only on a subset. Here’s how to calculate the average life expectancy in 2007 in Europe:

The results are shown in the following image:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (14)

Image 13 – Calculating average life expectancy for Europe in 2007

You can do much more with summary statistics, but that requires some grouping knowledge. Let’s cover that next.

Grouping

Summary statistics become much more powerful when combined with grouping. For example, you can use the group_by() function to calculate the average life expectancy per continent. Here’s how:

https://gist.github.com/darioappsilon/8b815ad3be908158c9d8c191dfa22af3

Here are the results:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (15)

Image 14 – Calculating average life expectancy per continent

You can also use the previously discussed ordering functions to arrange the dataset by average life expectancy. Here’s how to do so in a descending way:

The results are shown below:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (16)

Image 15 – Ordering dataset by average life expectancy per continent

One other powerful function is if_else(). You can use it when creating new columns whose value depends on some conditions.

For example, here’s how to create a column namedover75, which has a value ofYif the average life expectancy for a continent is over 75, andNotherwise:

The results are shown in the following image:

How to Analyze Data with R: A Complete Beginner Guide to dplyr (17)

Image 16 – Using if_else() upon attribute creation

And that’s all you should know about grouping! Let’s wrap things up next.

Conclusion

Today you’ve learned how to analyze data with R’s dplyr. It’s one of the most developer-friendly packages out there, way simpler than it’s Python competitor – Pandas.

You should be able to analyze and prepare any type of dataset after reading this article. You can do more advanced things, of course, but often these are just combinations of the things you’ve learned today.

Learn More

How to Analyze Data with R: A Complete Beginner Guide to dplyr (18)

Appsilon is hiring for remote roles! See ourCareerspage for all open positions, includingR Shiny Developers,Fullstack Engineers,Frontend Engineers, aSenior Infrastructure Engineer, and aCommunity Manager. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.

Article How to Analyze Data with R: A Complete Beginner Guide to dplyr comes from Appsilon | End­ to­ End Data Science Solutions.

Related

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to Analyze Data with R: A Complete Beginner Guide to dplyr (2024)

FAQs

What are the steps to perform data analysis in R? ›

  1. Data Analysis with R.
  2. Getting Set Up. Accessing RStudio. Basic Scripting. ...
  3. 1 Data Visualization. 1.1 Scatter Plots. ...
  4. 2 Data Transformation. 2.1 Filtering a Data Set. ...
  5. 3 Importing and Cleaning Data. 3.1 Importing Tabular Data. ...
  6. 4 Linear Regression. 4.1 Simple Linear Regression. ...
  7. 5 Programming in R. 5.1 Print to Screen. ...
  8. Final Project.

Which are 5 of the most commonly used dplyr functions in R? ›

We're going to learn some of the most common dplyr functions: select() , filter() , mutate() , group_by() , and summarize() . To select columns of a data frame, use select() .

Can you use R to analyze data? ›

R is a free, open source statistical programming language. It is useful for data cleaning, analysis, and visualization.

What functions contained in packages such as dplyr are used to 1 point? ›

The package "dplyr" comprises many functions that perform mostly used data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data.

What are the 7 steps of data analysis? ›

Why Data Analytics?
  • Step 1: Understanding the business problem. ...
  • Step 2: Analyze data requirements. ...
  • Step 3: Data understanding and collection. ...
  • Step 4: Data Preparation. ...
  • Step 5: Data visualization. ...
  • Step 6: Data analysis. ...
  • Step 7: Deployment.

What are the five 5 key steps of data analysis process? ›

It's a five-step framework to analyze data. The five steps are: 1) Identify business questions, 2) Collect and store data, 3) Clean and prepare data, 4) Analyze data, and 5) Visualize and communicate data.

Is Python or R better for data analysis? ›

This means that Python is more versatile and can be used for a wider range of tasks, such as web development, data manipulation, and machine learning. R, on the other hand, is primarily used for statistical analysis and data visualization.

Is R or Excel better for data analysis? ›

It is evident that the source code of R can be used repeatedly and with different data sets in ways that Excel formulas cannot. R clearly shows the code (instructions), data and columns used for an analysis in ways that Excel does not.

What is the disadvantage of using R as a data analytics tool? ›

R is slower than other programming languages like Python or MATLAB. It takes up a lot of memory. Memory management isn't one of R's strong points. R's data must be stored in physical memory.

What does %>% mean in dplyr? ›

The pipe operator (%>%) forces R to read functions left to right instead of right to left. It pipes, or transfers, output from the first function to the input of a second function. In the following code, we will invoke the select function, then invoke arrange. mtcars %>% select(cyl, mpg) %>% arrange (cyl, mpg)

What does dplyr stand for in R? ›

d is for data. frame , plyr as in a set of pliers to manipulate things with. dplyr is a data. frame specific set of tools like plyr .

What is the difference between dplyr and tidyverse? ›

dplyr: A package for data manipulation that uses a consistent and intuitive syntax that makes data manipulation tasks more straightforward. tidyr: A package for data tidying that helps you transform data between different formats, such as converting wide data to long format or vice versa.

What are the steps required for data analysis? ›

The data analysis process involves several steps, including defining objectives and questions, data collection, data cleaning, data analysis, data interpretation and visualization, and data storytelling. Each step is crucial to ensuring the accuracy and usefulness of the results.

What are the 4 steps of data analysis? ›

All four levels create the puzzle of analytics: describe, diagnose, predict, prescribe. When all four work together, you can truly succeed with a data and analytical strategy. If the four aren't working well together or one part is completely missing, the organization's data and analytical strategy isn't complete.

What steps do you take to analyze data? ›

How to analyze data
  1. Establish a goal. First, determine the purpose and key objectives of your data analysis. ...
  2. Determine the type of data analytics to use. Identify the type of data that can answer your questions. ...
  3. Determine a plan to produce the data. ...
  4. Collect the data. ...
  5. Clean the data. ...
  6. Evaluate the data. ...
  7. Diagnostic analysis.
Feb 3, 2023

References

Top Articles
New Hire Resources & Onboarding
Management: Human Resources Management Concentration, B.B.A. - Texas Tech University
Devotion Showtimes Near Xscape Theatres Blankenbaker 16
Navicent Human Resources Phone Number
Rosy Boa Snake — Turtle Bay
Fan Van Ari Alectra
What happened to Lori Petty? What is she doing today? Wiki
Wild Smile Stapleton
Puretalkusa.com/Amac
Skip The Games Norfolk Virginia
Deshret's Spirit
MADRID BALANZA, MªJ., y VIZCAÍNO SÁNCHEZ, J., 2008, "Collares de época bizantina procedentes de la necrópolis oriental de Carthago Spartaria", Verdolay, nº10, p.173-196.
Tugboat Information
Matthew Rotuno Johnson
No Strings Attached 123Movies
Curtains - Cheap Ready Made Curtains - Deconovo UK
Craigslist Panama City Fl
Me Cojo A Mama Borracha
Inside the life of 17-year-old Charli D'Amelio, the most popular TikTok star in the world who now has her own TV show and clothing line
Roll Out Gutter Extensions Lowe's
CDL Rostermania 2023-2024 | News, Rumors & Every Confirmed Roster
Aaa Saugus Ma Appointment
Pinellas Fire Active Calls
UPS Store #5038, The
Crawlers List Chicago
Dover Nh Power Outage
Cincinnati Adult Search
Encore Atlanta Cheer Competition
6892697335
Pain Out Maxx Kratom
1979 Ford F350 For Sale Craigslist
'Insidious: The Red Door': Release Date, Cast, Trailer, and What to Expect
Creed 3 Showtimes Near Island 16 Cinema De Lux
Nottingham Forest News Now
The Bold and the Beautiful
Greater Orangeburg
35 Boba Tea & Rolled Ice Cream Of Wesley Chapel
Vistatech Quadcopter Drone With Camera Reviews
Babbychula
Minecraft Jar Google Drive
Prima Healthcare Columbiana Ohio
The Best Carry-On Suitcases 2024, Tested and Reviewed by Travel Editors | SmarterTravel
Hisense Ht5021Kp Manual
Quake Awakening Fragments
Red Dead Redemption 2 Legendary Fish Locations Guide (“A Fisher of Fish”)
Conroe Isd Sign In
COVID-19/Coronavirus Assistance Programs | FindHelp.org
Pink Runtz Strain, The Ultimate Guide
Costco The Dalles Or
The Quiet Girl Showtimes Near Landmark Plaza Frontenac
Bluebird Valuation Appraiser Login
O'reilly's Eastman Georgia
Latest Posts
Article information

Author: Lidia Grady

Last Updated:

Views: 5620

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Lidia Grady

Birthday: 1992-01-22

Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

Phone: +29914464387516

Job: Customer Engineer

Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.