Data analysis using R - GeeksforGeeks (2024)

Last Updated : 09 Dec, 2022

Improve

Data Analysis is a subset of data analytics, it is a process where the objective has to be made clear, collect the relevant data, preprocess the data, perform analysis(understand the data, explore insights), and then visualize it. The last step visualization is important to make people understand what’s happening in the firm.

Steps involved in data analysis:

Data analysis using R - GeeksforGeeks (1)

The process of data analysis would include all these steps for the given problem statement. Example- Analyze the products that are being rapidly sold out and details of frequent customers of a retail shop.

  • Defining the problem statement – Understand the goal, and what is needed to be done. In this case, our problem statement is – “The product is mostly sold out and list of customers who often visit the store.”
  • Collection of data – Not all the company’s data is necessary, understand the relevant data according to the problem. Here the required columns are product ID, customer ID, and date visited.
  • Preprocessing – Cleaning the data is mandatory to put it in a structured format before performing analysis.
  1. Removing outliers( noisy data).
  2. Removing null or irrelevant values in the columns. (Change null values to mean value of that column.)
  3. If there is any missing data, either ignore the tuple or fill it with a mean value of the column.

Data Analysis using the Titanic dataset

You can download the titanic dataset (it contains data from real passengers of the titanic)from here. Save the dataset in the current working directory, now we will start analysis (getting to know our data).

R

titanic=read.csv("train.csv")

head(titanic)

Output:

 PassengerId Survived Pclass Name Sex1 892 0 3 Kelly, Mr. James male2 893 1 3 Wilkes, Mrs. James (Ellen Needs) female3 894 0 2 Myles, Mr. Thomas Francis male4 895 0 3 Wirz, Mr. Albert male5 896 1 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female6 897 0 3 Svensson, Mr. Johan Cervin male Age SibSp Parch Ticket Fare Cabin Embarked1 34.5 0 0 330911 7.8292 Q2 47.0 1 0 363272 7.0000 S3 62.0 0 0 240276 9.6875 Q4 27.0 0 0 315154 8.6625 S5 22.0 1 1 3101298 12.2875 S6 14.0 0 0 7538 9.2250 S

Our dataset contains all the columns like name, age, gender of the passenger and class they have traveled in, whether they have survived or not, etc. To understand the class(data type) of each column sapply() method can be used.

R

sapply(train, class)

Output:

PassengerId Survived Pclass Name Sex Age "integer" "integer" "integer" "character" "character" "numeric" SibSp Parch Ticket Fare Cabin Embarked "integer" "integer" "character" "numeric" "character" "character" 

We can categorize the value “survived” into “dead” to 0 and “alive” to 1 using factor() function.

R

train$Survived=as.factor(train$Survived)

train$Sex=as.factor(train$Sex)

sapply(train, class)

Output:

PassengerId Survived Pclass Name Sex Age "integer" "factor" "integer" "character" "factor" "numeric" SibSp Parch Ticket Fare Cabin Embarked "integer" "integer" "character" "numeric" "character" "character" 

We analyze data using a summary of all the columns, their values, and data types. summary() can be used for this purpose.

R

summary(train)

Output:

 PassengerId Survived Pclass Name Sex Min. : 892.0 0:266 Min. :1.000 Length:418 female:152 1st Qu.: 996.2 1:152 1st Qu.:1.000 Class :character male :266 Median :1100.5 Median :3.000 Mode :character Mean :1100.5 Mean :2.266 3rd Qu.:1204.8 3rd Qu.:3.000 Max. :1309.0 Max. :3.000 Age SibSp Parch Ticket Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character Median :27.00 Median :0.0000 Median :0.0000 Mode :character Mean :30.27 Mean :0.4474 Mean :0.3923 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000 Max. :76.00 Max. :8.0000 Max. :9.0000 NA's :86 Fare Cabin Embarked Min. : 0.000 Length:418 Length:418 1st Qu.: 7.896 Class :character Class :character Median : 14.454 Mode :character Mode :character Mean : 35.627 3rd Qu.: 31.500 Max. :512.329 NA's :1

From the above summary we can extract below observations:

  • Total passengers: 891
  • The number of total people who survived: 342
  • Number of total people dead: 549
  • Number of males in the titanic: 577
  • Number of females in the titanic: 314
  • Maximum age among all people in titanic: 80
  • Median age: 28

Preprocessing of the data is important before analysis, so null values have to be checked and removed.

R

sum(is.na(train))

Output:

177

R

dropnull_train=train[rowSums(is.na(train))<=0,]

  • dropnull_train contains only 631 rows because (total rows in dataset (808) – null value rows (177) = remaining rows (631) )
  • Now we will divide survived and dead people into a separate list from 631 rows.

R

survivedlist=dropnull_train[dropnull_train$Survived == 1,]

notsurvivedlist=dropnull_train[dropnull_train$Survived == 0,]

Now we can visualize the number of males and females dead and survived using bar plots, histograms, and piecharts.

R

mytable <- table(titanic$Survived)

lbls <- paste(names(mytable), "\n", mytable, sep="")

pie(mytable,

labels = lbls,

main="Pie Chart of Survived column data\n (with sample sizes)")

Output:

Data analysis using R - GeeksforGeeks (2)

From the above pie chart, we can certainly say that there is a data imbalance in the target/Survived column.

R

hist(survivedlist$Age,

xlab="gender",

ylab="frequency")

Output:

Data analysis using R - GeeksforGeeks (3)

Now let’s draw a bar plot to visualize the number of males and females who were there on the titanic ship.

R

barplot(table(notsurvivedlist$Sex),

xlab="gender",

ylab="frequency")

Output:

Data analysis using R - GeeksforGeeks (4)

From the barplot above we can analyze that there are nearly 350 males, and 50 females those are not survived in titanic.

R

temp<-density(table(titanic$Fare))

plot(temp, type="n",

main="Fare charged from Passengers")

polygon(temp, col="lightgray",

border="gray")

Output:

Data analysis using R - GeeksforGeeks (5)

Here we can observe that there are some passengers who are charged extremely high. So, these values can affect our analysis as they are outliers. Let’s confirm their presence using a boxplot.

R

boxplot(titanic$Fare,

main="Fare charged from passengers")

Output:

Data analysis using R - GeeksforGeeks (6)

Certainly, there are some extreme outliers present in this dataset.



S

sreelekhakolla957

Improve

Previous Article

Data Analysis with Python

Next Article

Data Analyst Interview Questions and Answers

Please Login to comment...

Data analysis using R - GeeksforGeeks (2024)

FAQs

How can R be used for data analysis? ›

One common use of R for business analytics is building custom data collection, clustering, and analytical models. Instead of opting for a pre-made approach, R data analysis allows companies to create statistics engines that can provide better, more relevant insights due to more precise data collection and storage.

Is Python or R better for data analysis? ›

If this is your first foray into computer programming, you may find Python code easier to learn and more broadly applicable. However, if you already have some understanding of programming languages or have specific career goals centered on data analysis, R language may be more tailored to your needs.

How long does it take to learn R for data analysis? ›

Brand new programmers may take six weeks to a few months to become comfortable with the R language. Three months is generally enough time for any new programmer to use the language and start applying it in their professional life. By setting a goal with Pluralsight's Skills app, you learn at your own pace.

How can you perform basic statistical analysis in R? ›

R can be used for these data management tasks.
  1. 1.4.1 Calculating new variables. New variables can be calculated using the 'assign' operator. For example, creating a total score by summing 4 scores: ...
  2. 1.4. 2 Creating categorical variables. The 'ifelse( )' function can be used to create a two-category variable.

Why R is a powerful tool for data analytics? ›

Because it was first designed by statisticians for statistical purposes, R is exceptionally well-suited to data science, an important field in today's world. While R's core function is statistical analysis and graphics, its use extends past these and into AI, machine learning, financial analysis, and more.

Is R more difficult than Python? ›

Overall, Python's easy-to-read syntax gives it a smoother learning curve. R tends to have a steeper learning curve at the beginning, but once you understand how to use its features, it gets significantly easier. Tip: Once you've learned one programming language, it's typically easier to learn another one.

Should I learn R or Python first? ›

Although R is designed to run basic data analysis easily and within minutes, things get harder with complex tasks, and it takes more time for R users to master the language. Overall, Python is considered a good language for beginner programmers.

Is R difficult to learn? ›

Learning R can be tough, especially for beginners. Let's explore why many struggle and how to overcome these challenges. R's unique syntax and steep learning curve often surprise new learners. Its complex data structures and error messages can be overwhelming, particularly for those new to programming.

Can I learn R in 2 days? ›

For learners who wish to master R as quickly as possible, it will take several hours a day of structured learning to become comfortable with this language in just a week or two.

Can I learn R on my own? ›

One of the most effective ways to get started learning R is to start using it. RStudio. cloud Primers offer a cloud-based learning environment that will teach you the basics of R all from the comfort of your browser.

Do you need R to be a data analyst? ›

Most entry-level Data Analyst roles aren't going to use Python or R, so if that is what you are currently working towards, I wouldn't spend much time on Python and R until you've mastered the tools above. However, once you have some experience, learning Python and R will open more doors.

How do I start data analysis in R? ›

  1. Data Analysis with R.
  2. Getting Set Up. Accessing RStudio. Basic Scripting. ...
  3. 1 Data Visualization. 1.1 Scatter Plots. ...
  4. 2 Data Transformation. 2.1 Filtering a Data Set. ...
  5. 3 Importing and Cleaning Data. 3.1 Importing Tabular Data. ...
  6. 4 Linear Regression. 4.1 Simple Linear Regression. ...
  7. 5 Programming in R. 5.1 Print to Screen. ...
  8. Final Project.

Why use R for data analysis? ›

R is a free, open source statistical programming language. It is useful for data cleaning, analysis, and visualization. It complements workflows that require the use of other software. You can read more about the language and find documentation on the R Project Website.

What are the 5 basic methods of statistical analysis? ›

The 5 methods for performing statistical analysis
  • Mean.
  • Standard Deviation.
  • Regression.
  • Hypothesis Testing.
  • Sample Size Determination.
Mar 2, 2021

What are the benefits of R programming in data analysis? ›

Advantages of R Programming
  • Open-Source Language.
  • Compatibility And Versatility.
  • Extensions.
  • Data Visualization.
  • Machine Learning.
  • Statistics.
  • A Vast Array Of Packages.
  • What is R Programming?
Mar 27, 2024

What are reasons one would choose to use R for data analysis? ›

Data analysts choose R because it can process large amounts of data efficiently, create high-quality visualizations, and because it's a data-centric open-source programming language.

What is R useful for? ›

R is widely used in data science by statisticians and data miners for data analysis and the development of statistical software. R is one of the most comprehensive statistical programming languages available, capable of handling everything from data manipulation and visualization to statistical analysis.

What does R mean in data analysis? ›

The correlation coefficient is the specific measure that quantifies the strength of the linear relationship between two variables in a correlation analysis. The coefficient is what we symbolize with the r in a correlation report.

References

Top Articles
My Husband Hates Me, But He Lost His Memories - Chapter 115.2 - Azure Coven
Www Biglots Com Payment
Sarah Coughlan Boobs
glizzy - Wiktionary, the free dictionary
Dr. med. Dupont, Allgemeinmediziner in Aachen
How To Get Mega Ring In Pokemon Radical Red
Officially Announcing: Skyward
Caldwell Idaho Craigslist
Maya Mixon Portnoy
Oriellys Bad Axe
National Weather Denver Co
5Ive Brother Cause Of Death
Alvin Isd Ixl
Kind Farms Reserve Medical And Recreational Cannabis Photos
Nwi Police Blotter
Baca's Funeral Chapels & Sunset Crematory Las Cruces Obituaries
Carle Mycarle
30+ useful Dutch apps for new expats in the Netherlands
O'reilly's In Mathis Texas
Pay Vgli
636-730-9503
Class B Permit Jobs
Janice Templeton Butt
Mercedes E-Klasse Rembekrachtigers voorraad | Onderdelenlijn.nl
eUprava - About eUprava portal
Toonily.cim
Daggett Funeral Home Barryton Michigan
Go Karts For Sale Near Me Under $500
Conner Westbury Funeral Home Griffin Ga Obituaries
Circuit Court Peoria Il
[TOP 18] Massage near you in Glan-y-Llyn - Find the best massage place for you!
Red Dragon Fort Mohave Az
South Park Old Fashioned Gif
Venezuela: un juez ordena la detención del candidato opositor Edmundo González Urrutia - BBC News Mundo
352-730-1982
Speedstepper
O'reilly's In Mathis Texas
FedEx zoekt een Linehaul Supervisor in Duiven | LinkedIn
Lubbock, Texas hotels, motels: rates, availability
Joftens Notes Skyrim
Fandafia
Used Cars for Sale in Phoenix, AZ (with Photos)
How Old Is Ted Williams Fox News Contributor
NDS | Kosttilskud, Probiotika & Collagen | Se udvalget her
Currently Confined Coles County
Motorcycle Sale By Owner
Pinellas Fire Active Calls
Unblocked Games Shooters
Sbc Workspace
Kieaira.boo
Poopybarbz
11 Fascinating Axolotl Facts
Latest Posts
Article information

Author: Velia Krajcik

Last Updated:

Views: 5638

Rating: 4.3 / 5 (54 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Velia Krajcik

Birthday: 1996-07-27

Address: 520 Balistreri Mount, South Armand, OR 60528

Phone: +466880739437

Job: Future Retail Associate

Hobby: Polo, Scouting, Worldbuilding, Cosplaying, Photography, Rowing, Nordic skating

Introduction: My name is Velia Krajcik, I am a handsome, clean, lucky, gleaming, magnificent, proud, glorious person who loves writing and wants to share my knowledge and understanding with you.