R data analysis example: Learn data analysis fundamentals with Titanic survivor prediction

The Titanic disaster of 1912 is one of the most famous maritime disasters in history, and it lives on in many people's memories. But did you know that you can analyze who the passengers on board the Titanic were and what factors determined their lives and deaths?

In this post, we'll use the How to use Titanic survivor data to predict which characteristics affected the probability of survival in an R data analysis example.In this post, we'll take you from data loading to analysis, visualization, and predictive modeling, so stay tuned!

R 데이터분석 예제-타이타닉 데이터 시각화

The importance of predicting Titanic survivors

Titanic data is often used as a very important example for learning data analytics. This dataset contains a wealth of information about passengers, including their gender, age, class of travel, and fare, making it a great place to analyze how these characteristics might have affected whether or not they survived. In this data analysis example, we'll build a simple logistic regression model to predict survival based on this information.

In this course, you'll get hands-on experience with the full spectrum of data analytics, from data preprocessing, exploratory data analysis (EDA), visualization, and predictive modeling.

Loading and exploring data

Let's start by importing and exploring the Titanic data in R. This data is stored in the KaggleThis is a representative training dataset provided by Airbnb Inc. that records various characteristics of passengers.

Loading data from R

Import the # dataset
install.packages("titanic")
library(titanic)

Load # Titanic data
data("titanic_train")
titanic <- titanic_train

Explore # data
str(titanic)
summary(titanic)
head(titanic)

Through the above code titanic_train You can import data into R and see the structure of the data and underlying statistics. The data is stored in Survived, Pclass, Sex, Age, Fare, Embarked and other variables.

R 데이터분석 예제-타이타닉 데이터 탐색1
[ str(titanic) execution result screen ].
R 데이터분석 예제-타이타닉 데이터 탐색2
[summary(titanic) run results screen ]
R 데이터분석 예제-타이타닉 데이터 탐색3
[ head(titanic) execution result screen ].

Data structure descriptions

  • Survived: survival status (0 = dead, 1 = alive)
  • Pclass: Flight class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
  • Sex: gender (male, female)
  • Age: age
  • Fare: Fare
  • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

With this data, you can analyze how the characteristics of each passenger affected their survival.

Data preprocessing and visualization

The Titanic data contains some missing values, most notably the Age There are a lot of missing values in the column, so we need to deal with that. After doing some basic preprocessing, let's visualize the survival rate by passenger age, gender, and flight class.

Data preprocessing

Handle # missing values (replace missing values with median)
titanic$Age[is.na(titanic$Age)] <- median(titanic$Age, na.rm = TRUE)

Check the # data structure
summary(titanic$Age)

# The result of executing the above code is #
   Min. 1st Qu.  Median Mean 3rd Qu.    Max.
   0.42 22.00 28.00 29.36 35.00 80.00 

In the code above, the Replace missing values with medianI did. After preprocessing the data and checking the data structure for age, we see that the average age is 29.36, the median age is 28, and the maximum and minimum ages are 80 years apart. Now let's visualize survival by age, gender, and flight class.

Visualize survival in R

Install and load # ggplot2
install.packages("ggplot2") # If you have it installed, you can delete it.
library(ggplot2)

# Visualize survival rates by gender and cabin class
ggplot(titanic, aes(x=Pclass, fill=as.factor(Survived))) +
  geom_bar(position="fill") +
  facet_wrap(~Sex) +
  labs(title="Survival by sex and cabin class", x="Cabin class", y="Percentage", fill="Survived")

This R data analysis example provides an intuitive way to see survival rates by gender and flight class. For example, we can visually see that female passengers in first class have a very high survival rate, showing that this was one of the biggest factors in whether or not they survived. For men, you can see that class doesn't have as much of an impact on whether or not they survive.

R 데이터분석 예제-타이타닉 데이터 시각화1

Survival prediction models using logistic regression

Now let's get serious and create a predictive model, this time using Logistic regressionLogistic regression is a technique often used in binary classification problems, and we'll use it to build a simple model to predict whether a passenger will survive based on the data.

Build a logistic regression model in R

Build a # logistic regression model
model |z|)
(Intercept) 4.6553374 0.5085945 9.153 < 2e-16 ***
Pclass -1.1529180 0.1355637 -8.505 < 2e-16 *** Sexmale -2.6072954 0.5085945 9.153 < 2e-16
Sexmale -2.6072959 0.1872514 -13.924 < 2e-16 *** Sexmale -2.6072959 0.1872514 -13.924 < 2e-16 *** Sexmale
Age -0.0331244 0.0073991 -4.477 7.58e-06 *** Β
Fare 0.0005922 0.0020347 0.291 0.771
---]
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.7 on 890 degrees of freedom
Residual deviance: 805.5 on 886 degrees of freedom
AIC: 815.5

Number of Fisher Scoring iterations: 5

This code allows you to create a logistic regression model that predicts survival by setting cabin class, gender, age, and fare as independent variables. summary() The function shows the coefficients of the model, which tells you which variables have a significant impact on survival.

Explanation of statistics

(Model Description)
glm(formula = Survived ~ Pclass + Sex + Age + Fare, family = binomial, data = titanic)
Survived to Dependent Variablesand set Pclass, Sex, Age, and Fare to Independent variablesfamily = binomial means that the model is a Binary categorization This means that we're dealing with a problem, which is categorized as either 0 (dead) or 1 (alive).
(Coefficients coefficient description)
- Intercept (y-intercept)
Estimate: 4.655
Meaning: This value represents the probability of survival when all independent variables are zero. In logistic regression, interpreting this value can be a bit complicated, but basically it represents the probability of survival in the absence of a particular variable.
- Pclass (Flight Class)
Estimate: -1.153
Meaning: It means that the probability of survival decreases as the class goes from 1st to 3rd class. Since the coefficient is negative, the higher the class of travel (1st class), the higher the probability of survival, and the lower the class (3rd class), the lower the probability of survival.
- Sex (Gender: Male)
Estimate: -2.607
Meaning: Shows that if the gender is male, the probability of survival is lower. Indicates that men are less likely to survive than women, which is consistent with the "women first" rescue principle during the Titanic disaster.
- Age
Estimate: -0.033
Meaning: The probability of survival decreases slightly with age. The coefficient on age is negative, meaning that the probability of survival decreases slightly as you get older.
- Fare
Estimate: 0.000592
Meaning: Higher fares tend to increase the probability of survival, but the value is very small and not statistically significant. In fact, the p-value (Pr(>|z|)) is 0.771, showing that fare has little effect on survival.
- p-value(Pr(>|z|))
The p-value shows the statistical significance of each variable. If the value is less than 0.05, the variable has a statistically significant impact on survival. In our results, three variables have a highly significant impact, and their respective p-values are as follows
Pclass: < 2e-16 (highly significant) | Sex: < 2e-16 (highly significant) | Age: 7.58e-06 (highly significant)
Fare, on the other hand, has a p-value of 0.771, which is not a statistically significant impact on survival.
- Zero and residual churn
Null deviance: 1186.7
The error when predicting only the average value without applying a model.
Residual deviance: 805.5
This is the error left after applying the model. Comparing these two values shows how well the model worked. Since the error has decreased, we can say that the model performed significantly.
-AIC (based on archive information): 815.5
AIC indicates the goodness of fit of a model, with lower values indicating a better model. AIC is useful when comparing different models.
(Synthesis)
Class of travel, gender, and age all had a highly significant impact on the probability of survival, with class of travel and gender being the most influential factors. Fare did not have a significant effect on survival. Overall, the model is a good predictor of Titanic survivors.

Evaluate prediction results

After building the model, we need to evaluate it. To evaluate how accurately the model predicts survival in practice, we generate predicted values and evaluate them through a confusion matrix.

Generate a # predicted value
pred  0.5 as survival
pred_class  0.5, 1, 0)

# Generate a confusion matrix compared to the true value
table(pred_class, titanic$Survived)

# Code execution result #
pred_class 0 1
         0 469 98
         1 80 244

This Data analysis examplescompares the predicted value to the actual value, categorizing it as alive (1) if it is greater than or equal to 0.5 and dead (0) otherwise. The confusion matrix allows us to evaluate the accuracy and predictive performance of the model. The model is fairly accurate (80%), correctly predicting the survival status of most passengers. It has a high precision (75.31 TP3T), meaning that when it predicts that someone will survive, it is usually right. However, the Reproducibility(71.4%), which misses some survivors and predicts them as non-survivors. There are also some false positives (14.6%), which incorrectly predicts that a passenger is alive when they are not.

Common mistakes in data analysis and how to fix them

R 데이터분석 예정 - 주요 실수 그림

There are many mistakes that can be made when analyzing data. In this data analysis example, we'll look at the main mistakes you might make and how to fix them.

  1. Mistakes in handling missing valuesSimply removing or incorrectly handling missing values can skew the results of your analysis. When handling missing values, consider using median or average values, or modeling techniques to replace them.
  2. Lack of data scaling: In logistic regression models, large scale differences in the data can lead to poor model performance. In particular, variables such as Fare may require scaling treatment.
  3. Overfitting issuesIf you create a model that is too well-fitted to the training data, it may perform poorly on new data. This is why it's important to evaluate model performance through cross-validation.

FAQs

Q1: Where can I download the Titanic data?
A: The Titanic dataset is available for download on Kaggle. It comes with a variety of datasets and is useful for modeling exercises.

Q2: How do logistic regression models work?
A: Logistic regression is used to predict binary variables rather than continuous variables. It outputs a probability value, which can be used to predict the likelihood of a particular event occurring.

Q3: How should I handle missing values in data preprocessing?
A: How you handle missing values depends on your situation. You can substitute a median or mean value, or use modeling techniques that include missing values.

Organize

In this post, we used Titanic survivor data as a data analytics example to build a survival prediction model. We preprocessed the data, gained insights from various visualizations, and performed predictive analysis using a logistic regression model. The Titanic data is a great learning resource that can be used to learn the basics of data analysis and practical applications.

You can use these examples to build your data analysis skills and challenge yourself with different datasets!

Similar Posts