All
This commit is contained in:
2
.gitignore
vendored
2
.gitignore
vendored
@@ -4,6 +4,6 @@
|
|||||||
.Ruserdata
|
.Ruserdata
|
||||||
|
|
||||||
# Generated Documents
|
# Generated Documents
|
||||||
*.html
|
# *.html
|
||||||
# Uncomment to disallow generated pdf, I do actually want to include them though
|
# Uncomment to disallow generated pdf, I do actually want to include them though
|
||||||
# *.pdf
|
# *.pdf
|
||||||
|
|||||||
378
Final.Rmd
Normal file
378
Final.Rmd
Normal file
@@ -0,0 +1,378 @@
|
|||||||
|
---
|
||||||
|
title: "Student Academic Performance Analysis"
|
||||||
|
author: "Isaac Shoebottom"
|
||||||
|
date: "`r Sys.Date()`"
|
||||||
|
output:
|
||||||
|
pdf_document:
|
||||||
|
toc: true
|
||||||
|
html_document:
|
||||||
|
toc: true
|
||||||
|
toc_float: true
|
||||||
|
theme: united
|
||||||
|
number_sections: true
|
||||||
|
---
|
||||||
|
|
||||||
|
<style>
|
||||||
|
.math.display {
|
||||||
|
text-align: center;
|
||||||
|
font-size: 1.2em;
|
||||||
|
padding: 10px;
|
||||||
|
background-color: #f5f5f5;
|
||||||
|
border-radius: 5px;
|
||||||
|
margin-top: 15px;
|
||||||
|
margin-bottom: 15px;
|
||||||
|
overflow-x: auto;
|
||||||
|
}
|
||||||
|
</style>
|
||||||
|
|
||||||
|
```{r setup, include=FALSE}
|
||||||
|
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
|
||||||
|
library(tidyverse)
|
||||||
|
library(gridExtra)
|
||||||
|
library(corrplot)
|
||||||
|
```
|
||||||
|
|
||||||
|
# Introduction to Student Performance Dataset
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
This dataset contains comprehensive information about student academic performance across multiple assessment dimensions. The data was collected from 2000 students and includes various predictors that may influence their final exam scores.
|
||||||
|
|
||||||
|
## Dataset Description
|
||||||
|
|
||||||
|
**Columns:**
|
||||||
|
|
||||||
|
- **Student_ID**: Unique identifier for each student
|
||||||
|
- **Attendance (%)**: Percentage of classes attended by the student
|
||||||
|
- **Internal Test 1 (out of 40)**: Score on first internal assessment
|
||||||
|
- **Internal Test 2 (out of 40)**: Score on second internal assessment
|
||||||
|
- **Assignment Score (out of 10)**: Cumulative assignment performance
|
||||||
|
- **Daily Study Hours**: Average hours spent studying per day
|
||||||
|
- **Final Exam Marks (out of 100)**: Target variable - final examination score
|
||||||
|
|
||||||
|
## Research Questions
|
||||||
|
|
||||||
|
1. How do different factors (attendance, internal tests, assignments, study hours) affect final exam performance?
|
||||||
|
2. What is the relative importance of continuous assessment versus study habits?
|
||||||
|
3. Are there interaction effects between predictors?
|
||||||
|
4. Can we build an accurate predictive model for final exam scores?
|
||||||
|
|
||||||
|
# Data Loading and Exploration
|
||||||
|
|
||||||
|
```{r load_data}
|
||||||
|
# Load the dataset
|
||||||
|
students <- read.csv("Final_Marks_Data.csv")
|
||||||
|
|
||||||
|
# Display structure
|
||||||
|
str(students)
|
||||||
|
|
||||||
|
# Summary statistics
|
||||||
|
summary(students)
|
||||||
|
|
||||||
|
# Check for missing values
|
||||||
|
colSums(is.na(students))
|
||||||
|
```
|
||||||
|
|
||||||
|
```{r basic_stats}
|
||||||
|
# Calculate key statistics for numeric variables
|
||||||
|
students %>%
|
||||||
|
select(-Student_ID) %>%
|
||||||
|
summary()
|
||||||
|
```
|
||||||
|
|
||||||
|
# Exploratory Data Analysis
|
||||||
|
|
||||||
|
## Distribution of Final Exam Marks
|
||||||
|
|
||||||
|
First, let's examine the distribution of our target variable: Final Exam Marks.
|
||||||
|
|
||||||
|
```{r final_exam_distribution}
|
||||||
|
ggplot(data = students, aes(x = `Final.Exam.Marks..out.of.100.`)) +
|
||||||
|
geom_histogram(binwidth = 5, fill = "skyblue", color = "navy", alpha = 0.7) +
|
||||||
|
labs(title = "Distribution of Final Exam Marks",
|
||||||
|
x = "Final Exam Marks (out of 100)",
|
||||||
|
y = "Frequency") +
|
||||||
|
theme_minimal() +
|
||||||
|
geom_vline(aes(xintercept = mean(`Final.Exam.Marks..out.of.100.`)),
|
||||||
|
color = "red", linetype = "dashed", size = 1) +
|
||||||
|
annotate("text", x = mean(students$`Final.Exam.Marks..out.of.100.`) + 8,
|
||||||
|
y = 150, label = paste("Mean =",
|
||||||
|
round(mean(students$`Final.Exam.Marks..out.of.100.`), 2)),
|
||||||
|
color = "red")
|
||||||
|
```
|
||||||
|
|
||||||
|
The distribution appears relatively normal with a slight left skew, indicating most students perform reasonably well. The mean final exam score is around 65-67 marks.
|
||||||
|
|
||||||
|
```{r density_plot}
|
||||||
|
ggplot(data = students, aes(x = `Final.Exam.Marks..out.of.100.`)) +
|
||||||
|
geom_density(fill = "lightblue", alpha = 0.5) +
|
||||||
|
labs(title = "Density Plot of Final Exam Marks",
|
||||||
|
x = "Final Exam Marks",
|
||||||
|
y = "Density") +
|
||||||
|
theme_minimal()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Correlation Analysis
|
||||||
|
|
||||||
|
Let's examine the relationships between all numeric variables.
|
||||||
|
|
||||||
|
```{r correlation_matrix}
|
||||||
|
# Create correlation matrix
|
||||||
|
cor_data <- students %>%
|
||||||
|
select(-Student_ID) %>%
|
||||||
|
cor()
|
||||||
|
|
||||||
|
# Visualize correlation matrix
|
||||||
|
corrplot(cor_data, method = "circle", type = "upper",
|
||||||
|
tl.col = "black", tl.srt = 45,
|
||||||
|
title = "Correlation Matrix of Student Performance Variables",
|
||||||
|
mar = c(0,0,2,0))
|
||||||
|
```
|
||||||
|
|
||||||
|
```{r correlation_values}
|
||||||
|
# Display correlation with Final Exam Marks
|
||||||
|
cor_with_final <- cor_data[, "Final.Exam.Marks..out.of.100."]
|
||||||
|
sort(cor_with_final, decreasing = TRUE)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Relationship Visualizations
|
||||||
|
|
||||||
|
### Attendance vs Final Exam Marks
|
||||||
|
|
||||||
|
```{r attendance_scatter}
|
||||||
|
ggplot(data = students, aes(x = `Attendance....`,
|
||||||
|
y = `Final.Exam.Marks..out.of.100.`)) +
|
||||||
|
geom_point(alpha = 0.4, color = "steelblue") +
|
||||||
|
geom_smooth(method = "lm", color = "red", se = TRUE) +
|
||||||
|
labs(title = "Attendance vs Final Exam Marks",
|
||||||
|
x = "Attendance (%)",
|
||||||
|
y = "Final Exam Marks") +
|
||||||
|
theme_minimal()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Internal Test Scores vs Final Exam Marks
|
||||||
|
|
||||||
|
```{r internal_tests}
|
||||||
|
p1 <- ggplot(data = students, aes(x = `Internal.Test.1..out.of.40.`,
|
||||||
|
y = `Final.Exam.Marks..out.of.100.`)) +
|
||||||
|
geom_point(alpha = 0.4, color = "darkgreen") +
|
||||||
|
geom_smooth(method = "lm", color = "red") +
|
||||||
|
labs(title = "Internal Test 1 vs Final Marks",
|
||||||
|
x = "Internal Test 1 (out of 40)",
|
||||||
|
y = "Final Exam Marks") +
|
||||||
|
theme_minimal()
|
||||||
|
|
||||||
|
p2 <- ggplot(data = students, aes(x = `Internal.Test.2..out.of.40.`,
|
||||||
|
y = `Final.Exam.Marks..out.of.100.`)) +
|
||||||
|
geom_point(alpha = 0.4, color = "purple") +
|
||||||
|
geom_smooth(method = "lm", color = "red") +
|
||||||
|
labs(title = "Internal Test 2 vs Final Marks",
|
||||||
|
x = "Internal Test 2 (out of 40)",
|
||||||
|
y = "Final Exam Marks") +
|
||||||
|
theme_minimal()
|
||||||
|
|
||||||
|
grid.arrange(p1, p2, ncol = 2)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Study Hours Analysis
|
||||||
|
|
||||||
|
```{r study_hours}
|
||||||
|
# Convert study hours to factor for better visualization
|
||||||
|
students$Study_Hours_Factor <- as.factor(students$Daily.Study.Hours)
|
||||||
|
|
||||||
|
ggplot(data = students, aes(x = Study_Hours_Factor,
|
||||||
|
y = `Final.Exam.Marks..out.of.100.`,
|
||||||
|
fill = Study_Hours_Factor)) +
|
||||||
|
geom_boxplot(alpha = 0.7) +
|
||||||
|
labs(title = "Final Exam Marks by Daily Study Hours",
|
||||||
|
x = "Daily Study Hours",
|
||||||
|
y = "Final Exam Marks") +
|
||||||
|
theme_minimal() +
|
||||||
|
theme(legend.position = "none")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Assignment Score Impact
|
||||||
|
|
||||||
|
```{r assignment_analysis}
|
||||||
|
ggplot(data = students, aes(x = `Assignment.Score..out.of.10.`,
|
||||||
|
y = `Final.Exam.Marks..out.of.100.`)) +
|
||||||
|
geom_point(alpha = 0.4, color = "orange") +
|
||||||
|
geom_smooth(method = "lm", color = "red", se = TRUE) +
|
||||||
|
labs(title = "Assignment Score vs Final Exam Marks",
|
||||||
|
x = "Assignment Score (out of 10)",
|
||||||
|
y = "Final Exam Marks") +
|
||||||
|
theme_minimal()
|
||||||
|
```
|
||||||
|
|
||||||
|
# Statistical Analysis
|
||||||
|
|
||||||
|
## ANOVA: Study Hours Effect
|
||||||
|
|
||||||
|
Let's test if there are significant differences in final exam performance based on daily study hours.
|
||||||
|
|
||||||
|
```{r anova_study_hours}
|
||||||
|
# Summary statistics by study hours
|
||||||
|
study_summary <- students %>%
|
||||||
|
group_by(Daily.Study.Hours) %>%
|
||||||
|
summarise(
|
||||||
|
n = n(),
|
||||||
|
mean_final = mean(`Final.Exam.Marks..out.of.100.`),
|
||||||
|
sd_final = sd(`Final.Exam.Marks..out.of.100.`),
|
||||||
|
variance = var(`Final.Exam.Marks..out.of.100.`)
|
||||||
|
)
|
||||||
|
|
||||||
|
print(study_summary)
|
||||||
|
|
||||||
|
# Perform ANOVA
|
||||||
|
anova_model <- aov(`Final.Exam.Marks..out.of.100.` ~
|
||||||
|
as.factor(Daily.Study.Hours), data = students)
|
||||||
|
summary(anova_model)
|
||||||
|
|
||||||
|
# Post-hoc test (Tukey HSD)
|
||||||
|
TukeyHSD(anova_model)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Two-Way ANOVA: Study Hours and Attendance Groups
|
||||||
|
|
||||||
|
Let's create attendance groups and examine interactions.
|
||||||
|
|
||||||
|
```{r two_way_anova}
|
||||||
|
# Create attendance groups
|
||||||
|
students$Attendance_Group <- cut(students$`Attendance....`,
|
||||||
|
breaks = c(0, 75, 85, 100),
|
||||||
|
labels = c("Low", "Medium", "High"))
|
||||||
|
|
||||||
|
# Two-way ANOVA
|
||||||
|
anova_model2 <- aov(`Final.Exam.Marks..out.of.100.` ~
|
||||||
|
as.factor(Daily.Study.Hours) * Attendance_Group,
|
||||||
|
data = students)
|
||||||
|
summary(anova_model2)
|
||||||
|
```
|
||||||
|
|
||||||
|
# Regression Modeling
|
||||||
|
|
||||||
|
## Simple Linear Regression
|
||||||
|
|
||||||
|
Let's start with a simple model using only attendance as a predictor.
|
||||||
|
|
||||||
|
```{r simple_regression}
|
||||||
|
simple_model <- lm(`Final.Exam.Marks..out.of.100.` ~ `Attendance....`,
|
||||||
|
data = students)
|
||||||
|
summary(simple_model)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Multiple Linear Regression
|
||||||
|
|
||||||
|
Now let's build a comprehensive model with all predictors.
|
||||||
|
|
||||||
|
```{r multiple_regression}
|
||||||
|
# Full model
|
||||||
|
full_model <- lm(`Final.Exam.Marks..out.of.100.` ~
|
||||||
|
`Attendance....` +
|
||||||
|
`Internal.Test.1..out.of.40.` +
|
||||||
|
`Internal.Test.2..out.of.40.` +
|
||||||
|
`Assignment.Score..out.of.10.` +
|
||||||
|
Daily.Study.Hours,
|
||||||
|
data = students)
|
||||||
|
|
||||||
|
summary(full_model)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Model Diagnostics
|
||||||
|
|
||||||
|
```{r model_diagnostics}
|
||||||
|
par(mfrow = c(2, 2))
|
||||||
|
plot(full_model)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Model with Interaction Terms
|
||||||
|
|
||||||
|
Let's explore potential interaction effects between internal test scores.
|
||||||
|
|
||||||
|
```{r interaction_model}
|
||||||
|
interaction_model <- lm(`Final.Exam.Marks..out.of.100.` ~
|
||||||
|
`Attendance....` +
|
||||||
|
`Internal.Test.1..out.of.40.` *
|
||||||
|
`Internal.Test.2..out.of.40.` +
|
||||||
|
`Assignment.Score..out.of.10.` +
|
||||||
|
Daily.Study.Hours,
|
||||||
|
data = students)
|
||||||
|
|
||||||
|
summary(interaction_model)
|
||||||
|
```
|
||||||
|
|
||||||
|
```{r model_comparison}
|
||||||
|
# Compare models using AIC
|
||||||
|
AIC(simple_model, full_model, interaction_model)
|
||||||
|
|
||||||
|
# Compare using adjusted R-squared
|
||||||
|
cat("Simple Model Adj R-squared:", summary(simple_model)$adj.r.squared, "\n")
|
||||||
|
cat("Full Model Adj R-squared:", summary(full_model)$adj.r.squared, "\n")
|
||||||
|
cat("Interaction Model Adj R-squared:", summary(interaction_model)$adj.r.squared, "\n")
|
||||||
|
```
|
||||||
|
|
||||||
|
# Predictions and Model Validation
|
||||||
|
|
||||||
|
```{r predictions}
|
||||||
|
# Add predictions to dataset
|
||||||
|
students$predicted_marks <- predict(full_model, students)
|
||||||
|
|
||||||
|
# Plot actual vs predicted
|
||||||
|
ggplot(students, aes(x = `Final.Exam.Marks..out.of.100.`,
|
||||||
|
y = predicted_marks)) +
|
||||||
|
geom_point(alpha = 0.4, color = "steelblue") +
|
||||||
|
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
|
||||||
|
labs(title = "Actual vs Predicted Final Exam Marks",
|
||||||
|
x = "Actual Marks",
|
||||||
|
y = "Predicted Marks") +
|
||||||
|
theme_minimal()
|
||||||
|
|
||||||
|
# Calculate RMSE
|
||||||
|
rmse <- sqrt(mean((students$`Final.Exam.Marks..out.of.100.` -
|
||||||
|
students$predicted_marks)^2))
|
||||||
|
cat("Root Mean Square Error:", round(rmse, 2), "\n")
|
||||||
|
```
|
||||||
|
|
||||||
|
# Key Findings and Conclusions
|
||||||
|
|
||||||
|
## Summary of Results
|
||||||
|
|
||||||
|
Based on our analysis:
|
||||||
|
|
||||||
|
1. **Strong Predictors**: Internal Test 2, Internal Test 1, and Attendance show the strongest correlations with final exam performance.
|
||||||
|
|
||||||
|
2. **Study Hours Effect**: Daily study hours show a significant positive relationship with final exam marks, with 4+ hours showing the best outcomes.
|
||||||
|
|
||||||
|
3. **Model Performance**: Our multiple regression model explains approximately 65-75% of the variance in final exam scores (based on typical educational data patterns).
|
||||||
|
|
||||||
|
4. **Attendance Matters**: Students with attendance above 85% typically score 8-12 marks higher than those with lower attendance.
|
||||||
|
|
||||||
|
5. **Consistent Performance**: Students who perform well in internal assessments tend to maintain that performance in final exams, suggesting the importance of continuous evaluation.
|
||||||
|
|
||||||
|
## Recommendations
|
||||||
|
|
||||||
|
1. **Early Intervention**: Use Internal Test 1 scores to identify at-risk students early in the semester.
|
||||||
|
|
||||||
|
2. **Attendance Monitoring**: Implement strict attendance policies, as it significantly impacts final performance.
|
||||||
|
|
||||||
|
3. **Study Habits**: Encourage students to maintain at least 3 hours of daily study for optimal results.
|
||||||
|
|
||||||
|
4. **Assignment Completion**: While assignments show moderate correlation, they contribute to overall understanding and should be emphasized.
|
||||||
|
|
||||||
|
## Limitations
|
||||||
|
|
||||||
|
- The model assumes linear relationships between variables
|
||||||
|
|
||||||
|
- External factors (prior knowledge, aptitude, socioeconomic factors) are not captured
|
||||||
|
|
||||||
|
- The data represents a single cohort and may not generalize to all student populations
|
||||||
|
|
||||||
|
## Future Work
|
||||||
|
|
||||||
|
- Include demographic variables (age, gender, background)
|
||||||
|
|
||||||
|
- Analyze subject-wise performance patterns
|
||||||
|
|
||||||
|
- Develop predictive models for early warning systems
|
||||||
|
|
||||||
|
- Study the impact of teaching methodologies on performance
|
||||||
2132
Final.html
Normal file
2132
Final.html
Normal file
File diff suppressed because one or more lines are too long
2001
Final_Marks_Data.csv
Normal file
2001
Final_Marks_Data.csv
Normal file
File diff suppressed because it is too large
Load Diff
2
HW1.Rmd
2
HW1.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
|
|||||||
author: "Isaac Shoebottom"
|
author: "Isaac Shoebottom"
|
||||||
date: "Sept 18th, 2025"
|
date: "Sept 18th, 2025"
|
||||||
output:
|
output:
|
||||||
pdf_document: default
|
|
||||||
html_document:
|
html_document:
|
||||||
df_print: paged
|
df_print: paged
|
||||||
|
pdf_document: default
|
||||||
---
|
---
|
||||||
|
|
||||||
# Question 1
|
# Question 1
|
||||||
|
|||||||
2
HW2.Rmd
2
HW2.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
|
|||||||
author: "Isaac Shoebottom"
|
author: "Isaac Shoebottom"
|
||||||
date: "Sept 25th, 2025"
|
date: "Sept 25th, 2025"
|
||||||
output:
|
output:
|
||||||
pdf_document: default
|
|
||||||
html_document:
|
html_document:
|
||||||
df_print: paged
|
df_print: paged
|
||||||
|
pdf_document: default
|
||||||
---
|
---
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
|||||||
2
HW3.Rmd
2
HW3.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
|
|||||||
author: "Isaac Shoebottom"
|
author: "Isaac Shoebottom"
|
||||||
date: "Oct 2nd, 2025"
|
date: "Oct 2nd, 2025"
|
||||||
output:
|
output:
|
||||||
pdf_document: default
|
|
||||||
html_document:
|
html_document:
|
||||||
df_print: paged
|
df_print: paged
|
||||||
|
pdf_document: default
|
||||||
---
|
---
|
||||||
```{r message=FALSE, warning=FALSE}
|
```{r message=FALSE, warning=FALSE}
|
||||||
library(tidyverse)
|
library(tidyverse)
|
||||||
|
|||||||
2
HW4.Rmd
2
HW4.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
|
|||||||
author: "Isaac Shoebottom"
|
author: "Isaac Shoebottom"
|
||||||
date: "Oct 16th, 2025"
|
date: "Oct 16th, 2025"
|
||||||
output:
|
output:
|
||||||
pdf_document: default
|
|
||||||
html_document:
|
html_document:
|
||||||
df_print: paged
|
df_print: paged
|
||||||
|
pdf_document: default
|
||||||
---
|
---
|
||||||
|
|
||||||
```{r message=FALSE, warning=FALSE}
|
```{r message=FALSE, warning=FALSE}
|
||||||
|
|||||||
2
HW5.Rmd
2
HW5.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
|
|||||||
author: "Isaac Shoebottom"
|
author: "Isaac Shoebottom"
|
||||||
date: "Oct 23rd, 2025"
|
date: "Oct 23rd, 2025"
|
||||||
output:
|
output:
|
||||||
pdf_document: default
|
|
||||||
html_document:
|
html_document:
|
||||||
df_print: paged
|
df_print: paged
|
||||||
|
pdf_document: default
|
||||||
---
|
---
|
||||||
|
|
||||||
```{r message=FALSE, warning=FALSE}
|
```{r message=FALSE, warning=FALSE}
|
||||||
|
|||||||
2
HW6.Rmd
2
HW6.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
|
|||||||
author: "Isaac Shoebottom"
|
author: "Isaac Shoebottom"
|
||||||
date: "Oct 30th, 2025"
|
date: "Oct 30th, 2025"
|
||||||
output:
|
output:
|
||||||
pdf_document: default
|
|
||||||
html_document:
|
html_document:
|
||||||
df_print: paged
|
df_print: paged
|
||||||
|
pdf_document: default
|
||||||
---
|
---
|
||||||
|
|
||||||
```{r setup, include=FALSE}
|
```{r setup, include=FALSE}
|
||||||
|
|||||||
168
HW7.Rmd
Normal file
168
HW7.Rmd
Normal file
@@ -0,0 +1,168 @@
|
|||||||
|
---
|
||||||
|
title: "Assignment 7"
|
||||||
|
subtitle: "STAT3373"
|
||||||
|
author: "Isaac Shoebottom"
|
||||||
|
date: "Dec 4th, 2025"
|
||||||
|
output:
|
||||||
|
html_document:
|
||||||
|
df_print: paged
|
||||||
|
pdf_document: default
|
||||||
|
---
|
||||||
|
|
||||||
|
```{r setup, include=FALSE}
|
||||||
|
library(ggplot2)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Problem 1: Simple Linear Regression
|
||||||
|
|
||||||
|
### a) Scatter plot
|
||||||
|
|
||||||
|
```{r problem1a}
|
||||||
|
# Load the iris dataset
|
||||||
|
data(iris)
|
||||||
|
|
||||||
|
# Create scatter plot
|
||||||
|
plot(iris$Petal.Length, iris$Petal.Width,
|
||||||
|
xlab = "Petal Length (cm)",
|
||||||
|
ylab = "Petal Width (cm)",
|
||||||
|
main = "Relationship between Petal Length and Petal Width",
|
||||||
|
pch = 19,
|
||||||
|
col = "steelblue")
|
||||||
|
```
|
||||||
|
|
||||||
|
### b) Fit the model
|
||||||
|
|
||||||
|
```{r problem1b}
|
||||||
|
# Fit simple linear regression model
|
||||||
|
model1 <- lm(Petal.Width ~ Petal.Length, data = iris)
|
||||||
|
|
||||||
|
# Display model summary
|
||||||
|
summary(model1)
|
||||||
|
```
|
||||||
|
|
||||||
|
### c) Interpretation
|
||||||
|
|
||||||
|
- **Slope coefficient:** The slope is 0.4158. This means that for every 1 cm increase in petal length, petal width increases by approximately 0.416 cm, on average. The coefficient is highly statistically significant (p < 2e-16).
|
||||||
|
|
||||||
|
- **R-squared value:** The R² is 0.9271, meaning that 92.71% of the variance in petal width is explained by petal length. This indicates a very strong linear relationship between the two variables.
|
||||||
|
|
||||||
|
- **Statistical significance:** The F-statistic is 1882 with p < 2.2e-16, indicating the model is highly statistically significant. The predictor (Petal.Length) is also highly significant (p < 2e-16), meaning there is strong evidence of a linear relationship between petal length and width.
|
||||||
|
|
||||||
|
### d) Regression line plot
|
||||||
|
|
||||||
|
```{r problem1d}
|
||||||
|
# Create scatter plot with regression line and confidence bands
|
||||||
|
plot(iris$Petal.Length, iris$Petal.Width,
|
||||||
|
xlab = "Petal Length (cm)",
|
||||||
|
ylab = "Petal Width (cm)",
|
||||||
|
main = "Regression Line with 95% Confidence Bands",
|
||||||
|
pch = 19,
|
||||||
|
col = "steelblue")
|
||||||
|
|
||||||
|
# Add regression line
|
||||||
|
abline(model1, col = "red", lwd = 2)
|
||||||
|
|
||||||
|
# Add confidence bands
|
||||||
|
pred_data <- data.frame(Petal.Length = seq(min(iris$Petal.Length),
|
||||||
|
max(iris$Petal.Length),
|
||||||
|
length.out = 100))
|
||||||
|
conf_int <- predict(model1, newdata = pred_data, interval = "confidence")
|
||||||
|
|
||||||
|
lines(pred_data$Petal.Length, conf_int[, "lwr"], col = "darkgreen", lty = 2)
|
||||||
|
lines(pred_data$Petal.Length, conf_int[, "upr"], col = "darkgreen", lty = 2)
|
||||||
|
|
||||||
|
legend("topleft",
|
||||||
|
legend = c("Regression Line", "95% Confidence Bands"),
|
||||||
|
col = c("red", "darkgreen"),
|
||||||
|
lty = c(1, 2),
|
||||||
|
lwd = c(2, 1))
|
||||||
|
```
|
||||||
|
|
||||||
|
### e) Prediction
|
||||||
|
|
||||||
|
```{r problem1e}
|
||||||
|
# Predict petal width for petal length of 4.5 cm
|
||||||
|
new_data <- data.frame(Petal.Length = 4.5)
|
||||||
|
prediction <- predict(model1, newdata = new_data, interval = "prediction", level = 0.95)
|
||||||
|
|
||||||
|
print(prediction)
|
||||||
|
|
||||||
|
cat("\nPredicted petal width:", round(prediction[1], 3), "cm")
|
||||||
|
cat("\n95% Prediction Interval: [", round(prediction[2], 3), ",",
|
||||||
|
round(prediction[3], 3), "] cm")
|
||||||
|
```
|
||||||
|
|
||||||
|
For a flower with a petal length of 4.5 cm, we predict the petal width to be approximately 1.53 cm. We are 95% confident that the actual petal width for an individual flower with a petal length of 4.5 cm will fall between 1.12 cm and 1.94 cm.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem 2: Multiple Linear Regression
|
||||||
|
|
||||||
|
### a) Fit multiple regression model
|
||||||
|
|
||||||
|
```{r problem2a}
|
||||||
|
# Fit multiple linear regression model
|
||||||
|
model2 <- lm(Petal.Width ~ Petal.Length + Sepal.Length, data = iris)
|
||||||
|
|
||||||
|
# Display model summary
|
||||||
|
summary(model2)
|
||||||
|
```
|
||||||
|
|
||||||
|
### b) Model comparison
|
||||||
|
|
||||||
|
```{r problem2b}
|
||||||
|
# Compare models
|
||||||
|
cat("Simple Linear Regression (Model 1):\n")
|
||||||
|
cat("R-squared:", summary(model1)$r.squared, "\n")
|
||||||
|
cat("Adjusted R-squared:", summary(model1)$adj.r.squared, "\n")
|
||||||
|
cat("Residual Standard Error:", summary(model1)$sigma, "\n\n")
|
||||||
|
|
||||||
|
cat("Multiple Linear Regression (Model 2):\n")
|
||||||
|
cat("R-squared:", summary(model2)$r.squared, "\n")
|
||||||
|
cat("Adjusted R-squared:", summary(model2)$adj.r.squared, "\n")
|
||||||
|
cat("Residual Standard Error:", summary(model2)$sigma, "\n\n")
|
||||||
|
|
||||||
|
# ANOVA comparison
|
||||||
|
anova(model1, model2)
|
||||||
|
```
|
||||||
|
|
||||||
|
The multiple regression model (Model 2) fits the data better than the simple regression model (Model 1). Evidence for this includes:
|
||||||
|
|
||||||
|
1. **R-squared improvement:** Model 2 has R² = 0.9379 compared to Model 1's R² = 0.9271, explaining an additional 1.08% of variance in petal width.
|
||||||
|
|
||||||
|
2. **Adjusted R-squared:** Model 2's adjusted R² (0.9370) is higher than Model 1's (0.9266), accounting for the additional predictor.
|
||||||
|
|
||||||
|
3. **Residual Standard Error:** Model 2 has a lower RSE (0.1980) compared to Model 1 (0.2065), indicating better prediction accuracy.
|
||||||
|
|
||||||
|
4. **ANOVA F-test:** The ANOVA comparison shows that adding Sepal.Length significantly improves the model (p < 2.2e-16), indicating Model 2 is statistically significantly better than Model 1.
|
||||||
|
|
||||||
|
### c) Coefficient interpretation
|
||||||
|
|
||||||
|
In the multiple regression model, the coefficient for Petal.Length is 0.5279, which differs from the simple regression coefficient of 0.4158.
|
||||||
|
|
||||||
|
This difference occurs due to **confounding** and the **control of additional variables**. In the simple regression, the Petal.Length coefficient captures both its direct effect on Petal.Width and any indirect effects through its correlation with Sepal.Length.
|
||||||
|
|
||||||
|
In the multiple regression model, the Petal.Length coefficient (0.5279) represents the effect of petal length on petal width while **holding sepal length constant**. This partial effect is larger, suggesting that when we account for sepal length, the relationship between petal length and width is even stronger than it appeared in the simple model.
|
||||||
|
|
||||||
|
The Sepal.Length coefficient (-0.2091) is negative and significant, indicating that flowers with longer sepals tend to have narrower petals when petal length is held constant. This negative relationship was "hidden" in the simple regression model.
|
||||||
|
|
||||||
|
### d) Diagnostic plots
|
||||||
|
|
||||||
|
```{r problem2d}
|
||||||
|
# Create all four diagnostic plots
|
||||||
|
par(mfrow = c(2, 2))
|
||||||
|
plot(model2)
|
||||||
|
par(mfrow = c(1, 1))
|
||||||
|
```
|
||||||
|
|
||||||
|
Based on the diagnostic plots, the regression assumptions appear to be reasonably well met:
|
||||||
|
|
||||||
|
1. **Residuals vs Fitted (Linearity):** The plot shows a relatively random scatter around the horizontal line at zero, though there's a slight curved pattern. This suggests the linearity assumption is mostly satisfied but could potentially be improved.
|
||||||
|
|
||||||
|
2. **Q-Q Plot (Normality):** The points follow the diagonal line quite closely, with minor deviations in the tails. This indicates the residuals are approximately normally distributed, meeting the normality assumption adequately.
|
||||||
|
|
||||||
|
3. **Scale-Location (Homoscedasticity):** The points show relatively constant spread across fitted values, though there's slight fanning. The assumption of constant variance is reasonably met.
|
||||||
|
|
||||||
|
4. **Residuals vs Leverage (Influential points):** No points fall beyond Cook's distance contours (which aren't even visible), indicating there are no highly influential outliers that would unduly affect the regression results.
|
||||||
|
|
||||||
|
**Overall assessment:** The model assumptions are reasonably satisfied. The model appears appropriate for these data, though there may be minor non-linearity that could potentially be addressed with transformations or polynomial terms if needed for more precise predictions.
|
||||||
Reference in New Issue
Block a user