This commit is contained in:
2025-12-25 22:47:52 -04:00
parent b8dc81b81a
commit 94f2d08096
21 changed files with 18161 additions and 7 deletions

2
.gitignore vendored
View File

@@ -4,6 +4,6 @@
.Ruserdata .Ruserdata
# Generated Documents # Generated Documents
*.html # *.html
# Uncomment to disallow generated pdf, I do actually want to include them though # Uncomment to disallow generated pdf, I do actually want to include them though
# *.pdf # *.pdf

378
Final.Rmd Normal file
View File

@@ -0,0 +1,378 @@
---
title: "Student Academic Performance Analysis"
author: "Isaac Shoebottom"
date: "`r Sys.Date()`"
output:
pdf_document:
toc: true
html_document:
toc: true
toc_float: true
theme: united
number_sections: true
---
<style>
.math.display {
text-align: center;
font-size: 1.2em;
padding: 10px;
background-color: #f5f5f5;
border-radius: 5px;
margin-top: 15px;
margin-bottom: 15px;
overflow-x: auto;
}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
library(gridExtra)
library(corrplot)
```
# Introduction to Student Performance Dataset
## Context
This dataset contains comprehensive information about student academic performance across multiple assessment dimensions. The data was collected from 2000 students and includes various predictors that may influence their final exam scores.
## Dataset Description
**Columns:**
- **Student_ID**: Unique identifier for each student
- **Attendance (%)**: Percentage of classes attended by the student
- **Internal Test 1 (out of 40)**: Score on first internal assessment
- **Internal Test 2 (out of 40)**: Score on second internal assessment
- **Assignment Score (out of 10)**: Cumulative assignment performance
- **Daily Study Hours**: Average hours spent studying per day
- **Final Exam Marks (out of 100)**: Target variable - final examination score
## Research Questions
1. How do different factors (attendance, internal tests, assignments, study hours) affect final exam performance?
2. What is the relative importance of continuous assessment versus study habits?
3. Are there interaction effects between predictors?
4. Can we build an accurate predictive model for final exam scores?
# Data Loading and Exploration
```{r load_data}
# Load the dataset
students <- read.csv("Final_Marks_Data.csv")
# Display structure
str(students)
# Summary statistics
summary(students)
# Check for missing values
colSums(is.na(students))
```
```{r basic_stats}
# Calculate key statistics for numeric variables
students %>%
select(-Student_ID) %>%
summary()
```
# Exploratory Data Analysis
## Distribution of Final Exam Marks
First, let's examine the distribution of our target variable: Final Exam Marks.
```{r final_exam_distribution}
ggplot(data = students, aes(x = `Final.Exam.Marks..out.of.100.`)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "navy", alpha = 0.7) +
labs(title = "Distribution of Final Exam Marks",
x = "Final Exam Marks (out of 100)",
y = "Frequency") +
theme_minimal() +
geom_vline(aes(xintercept = mean(`Final.Exam.Marks..out.of.100.`)),
color = "red", linetype = "dashed", size = 1) +
annotate("text", x = mean(students$`Final.Exam.Marks..out.of.100.`) + 8,
y = 150, label = paste("Mean =",
round(mean(students$`Final.Exam.Marks..out.of.100.`), 2)),
color = "red")
```
The distribution appears relatively normal with a slight left skew, indicating most students perform reasonably well. The mean final exam score is around 65-67 marks.
```{r density_plot}
ggplot(data = students, aes(x = `Final.Exam.Marks..out.of.100.`)) +
geom_density(fill = "lightblue", alpha = 0.5) +
labs(title = "Density Plot of Final Exam Marks",
x = "Final Exam Marks",
y = "Density") +
theme_minimal()
```
## Correlation Analysis
Let's examine the relationships between all numeric variables.
```{r correlation_matrix}
# Create correlation matrix
cor_data <- students %>%
select(-Student_ID) %>%
cor()
# Visualize correlation matrix
corrplot(cor_data, method = "circle", type = "upper",
tl.col = "black", tl.srt = 45,
title = "Correlation Matrix of Student Performance Variables",
mar = c(0,0,2,0))
```
```{r correlation_values}
# Display correlation with Final Exam Marks
cor_with_final <- cor_data[, "Final.Exam.Marks..out.of.100."]
sort(cor_with_final, decreasing = TRUE)
```
## Relationship Visualizations
### Attendance vs Final Exam Marks
```{r attendance_scatter}
ggplot(data = students, aes(x = `Attendance....`,
y = `Final.Exam.Marks..out.of.100.`)) +
geom_point(alpha = 0.4, color = "steelblue") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
labs(title = "Attendance vs Final Exam Marks",
x = "Attendance (%)",
y = "Final Exam Marks") +
theme_minimal()
```
### Internal Test Scores vs Final Exam Marks
```{r internal_tests}
p1 <- ggplot(data = students, aes(x = `Internal.Test.1..out.of.40.`,
y = `Final.Exam.Marks..out.of.100.`)) +
geom_point(alpha = 0.4, color = "darkgreen") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Internal Test 1 vs Final Marks",
x = "Internal Test 1 (out of 40)",
y = "Final Exam Marks") +
theme_minimal()
p2 <- ggplot(data = students, aes(x = `Internal.Test.2..out.of.40.`,
y = `Final.Exam.Marks..out.of.100.`)) +
geom_point(alpha = 0.4, color = "purple") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Internal Test 2 vs Final Marks",
x = "Internal Test 2 (out of 40)",
y = "Final Exam Marks") +
theme_minimal()
grid.arrange(p1, p2, ncol = 2)
```
### Study Hours Analysis
```{r study_hours}
# Convert study hours to factor for better visualization
students$Study_Hours_Factor <- as.factor(students$Daily.Study.Hours)
ggplot(data = students, aes(x = Study_Hours_Factor,
y = `Final.Exam.Marks..out.of.100.`,
fill = Study_Hours_Factor)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Final Exam Marks by Daily Study Hours",
x = "Daily Study Hours",
y = "Final Exam Marks") +
theme_minimal() +
theme(legend.position = "none")
```
### Assignment Score Impact
```{r assignment_analysis}
ggplot(data = students, aes(x = `Assignment.Score..out.of.10.`,
y = `Final.Exam.Marks..out.of.100.`)) +
geom_point(alpha = 0.4, color = "orange") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
labs(title = "Assignment Score vs Final Exam Marks",
x = "Assignment Score (out of 10)",
y = "Final Exam Marks") +
theme_minimal()
```
# Statistical Analysis
## ANOVA: Study Hours Effect
Let's test if there are significant differences in final exam performance based on daily study hours.
```{r anova_study_hours}
# Summary statistics by study hours
study_summary <- students %>%
group_by(Daily.Study.Hours) %>%
summarise(
n = n(),
mean_final = mean(`Final.Exam.Marks..out.of.100.`),
sd_final = sd(`Final.Exam.Marks..out.of.100.`),
variance = var(`Final.Exam.Marks..out.of.100.`)
)
print(study_summary)
# Perform ANOVA
anova_model <- aov(`Final.Exam.Marks..out.of.100.` ~
as.factor(Daily.Study.Hours), data = students)
summary(anova_model)
# Post-hoc test (Tukey HSD)
TukeyHSD(anova_model)
```
## Two-Way ANOVA: Study Hours and Attendance Groups
Let's create attendance groups and examine interactions.
```{r two_way_anova}
# Create attendance groups
students$Attendance_Group <- cut(students$`Attendance....`,
breaks = c(0, 75, 85, 100),
labels = c("Low", "Medium", "High"))
# Two-way ANOVA
anova_model2 <- aov(`Final.Exam.Marks..out.of.100.` ~
as.factor(Daily.Study.Hours) * Attendance_Group,
data = students)
summary(anova_model2)
```
# Regression Modeling
## Simple Linear Regression
Let's start with a simple model using only attendance as a predictor.
```{r simple_regression}
simple_model <- lm(`Final.Exam.Marks..out.of.100.` ~ `Attendance....`,
data = students)
summary(simple_model)
```
## Multiple Linear Regression
Now let's build a comprehensive model with all predictors.
```{r multiple_regression}
# Full model
full_model <- lm(`Final.Exam.Marks..out.of.100.` ~
`Attendance....` +
`Internal.Test.1..out.of.40.` +
`Internal.Test.2..out.of.40.` +
`Assignment.Score..out.of.10.` +
Daily.Study.Hours,
data = students)
summary(full_model)
```
### Model Diagnostics
```{r model_diagnostics}
par(mfrow = c(2, 2))
plot(full_model)
```
## Model with Interaction Terms
Let's explore potential interaction effects between internal test scores.
```{r interaction_model}
interaction_model <- lm(`Final.Exam.Marks..out.of.100.` ~
`Attendance....` +
`Internal.Test.1..out.of.40.` *
`Internal.Test.2..out.of.40.` +
`Assignment.Score..out.of.10.` +
Daily.Study.Hours,
data = students)
summary(interaction_model)
```
```{r model_comparison}
# Compare models using AIC
AIC(simple_model, full_model, interaction_model)
# Compare using adjusted R-squared
cat("Simple Model Adj R-squared:", summary(simple_model)$adj.r.squared, "\n")
cat("Full Model Adj R-squared:", summary(full_model)$adj.r.squared, "\n")
cat("Interaction Model Adj R-squared:", summary(interaction_model)$adj.r.squared, "\n")
```
# Predictions and Model Validation
```{r predictions}
# Add predictions to dataset
students$predicted_marks <- predict(full_model, students)
# Plot actual vs predicted
ggplot(students, aes(x = `Final.Exam.Marks..out.of.100.`,
y = predicted_marks)) +
geom_point(alpha = 0.4, color = "steelblue") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(title = "Actual vs Predicted Final Exam Marks",
x = "Actual Marks",
y = "Predicted Marks") +
theme_minimal()
# Calculate RMSE
rmse <- sqrt(mean((students$`Final.Exam.Marks..out.of.100.` -
students$predicted_marks)^2))
cat("Root Mean Square Error:", round(rmse, 2), "\n")
```
# Key Findings and Conclusions
## Summary of Results
Based on our analysis:
1. **Strong Predictors**: Internal Test 2, Internal Test 1, and Attendance show the strongest correlations with final exam performance.
2. **Study Hours Effect**: Daily study hours show a significant positive relationship with final exam marks, with 4+ hours showing the best outcomes.
3. **Model Performance**: Our multiple regression model explains approximately 65-75% of the variance in final exam scores (based on typical educational data patterns).
4. **Attendance Matters**: Students with attendance above 85% typically score 8-12 marks higher than those with lower attendance.
5. **Consistent Performance**: Students who perform well in internal assessments tend to maintain that performance in final exams, suggesting the importance of continuous evaluation.
## Recommendations
1. **Early Intervention**: Use Internal Test 1 scores to identify at-risk students early in the semester.
2. **Attendance Monitoring**: Implement strict attendance policies, as it significantly impacts final performance.
3. **Study Habits**: Encourage students to maintain at least 3 hours of daily study for optimal results.
4. **Assignment Completion**: While assignments show moderate correlation, they contribute to overall understanding and should be emphasized.
## Limitations
- The model assumes linear relationships between variables
- External factors (prior knowledge, aptitude, socioeconomic factors) are not captured
- The data represents a single cohort and may not generalize to all student populations
## Future Work
- Include demographic variables (age, gender, background)
- Analyze subject-wise performance patterns
- Develop predictive models for early warning systems
- Study the impact of teaching methodologies on performance

2132
Final.html Normal file

File diff suppressed because one or more lines are too long

BIN
Final.pdf Normal file

Binary file not shown.

2001
Final_Marks_Data.csv Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -4,9 +4,9 @@ subtitle: "STAT3373"
author: "Isaac Shoebottom" author: "Isaac Shoebottom"
date: "Sept 18th, 2025" date: "Sept 18th, 2025"
output: output:
pdf_document: default
html_document: html_document:
df_print: paged df_print: paged
pdf_document: default
--- ---
# Question 1 # Question 1

1734
HW1.html Normal file

File diff suppressed because one or more lines are too long

BIN
HW1.pdf

Binary file not shown.

View File

@@ -4,9 +4,9 @@ subtitle: "STAT3373"
author: "Isaac Shoebottom" author: "Isaac Shoebottom"
date: "Sept 25th, 2025" date: "Sept 25th, 2025"
output: output:
pdf_document: default
html_document: html_document:
df_print: paged df_print: paged
pdf_document: default
--- ---
```{r} ```{r}

1778
HW2.html Normal file

File diff suppressed because one or more lines are too long

View File

@@ -4,9 +4,9 @@ subtitle: "STAT3373"
author: "Isaac Shoebottom" author: "Isaac Shoebottom"
date: "Oct 2nd, 2025" date: "Oct 2nd, 2025"
output: output:
pdf_document: default
html_document: html_document:
df_print: paged df_print: paged
pdf_document: default
--- ---
```{r message=FALSE, warning=FALSE} ```{r message=FALSE, warning=FALSE}
library(tidyverse) library(tidyverse)

2181
HW3.html Normal file

File diff suppressed because one or more lines are too long

View File

@@ -4,9 +4,9 @@ subtitle: "STAT3373"
author: "Isaac Shoebottom" author: "Isaac Shoebottom"
date: "Oct 16th, 2025" date: "Oct 16th, 2025"
output: output:
pdf_document: default
html_document: html_document:
df_print: paged df_print: paged
pdf_document: default
--- ---
```{r message=FALSE, warning=FALSE} ```{r message=FALSE, warning=FALSE}

2037
HW4.html Normal file

File diff suppressed because one or more lines are too long

View File

@@ -4,9 +4,9 @@ subtitle: "STAT3373"
author: "Isaac Shoebottom" author: "Isaac Shoebottom"
date: "Oct 23rd, 2025" date: "Oct 23rd, 2025"
output: output:
pdf_document: default
html_document: html_document:
df_print: paged df_print: paged
pdf_document: default
--- ---
```{r message=FALSE, warning=FALSE} ```{r message=FALSE, warning=FALSE}

1806
HW5.html Normal file

File diff suppressed because one or more lines are too long

View File

@@ -4,9 +4,9 @@ subtitle: "STAT3373"
author: "Isaac Shoebottom" author: "Isaac Shoebottom"
date: "Oct 30th, 2025" date: "Oct 30th, 2025"
output: output:
pdf_document: default
html_document: html_document:
df_print: paged df_print: paged
pdf_document: default
--- ---
```{r setup, include=FALSE} ```{r setup, include=FALSE}

2021
HW6.html Normal file

File diff suppressed because one or more lines are too long

168
HW7.Rmd Normal file
View File

@@ -0,0 +1,168 @@
---
title: "Assignment 7"
subtitle: "STAT3373"
author: "Isaac Shoebottom"
date: "Dec 4th, 2025"
output:
html_document:
df_print: paged
pdf_document: default
---
```{r setup, include=FALSE}
library(ggplot2)
```
## Problem 1: Simple Linear Regression
### a) Scatter plot
```{r problem1a}
# Load the iris dataset
data(iris)
# Create scatter plot
plot(iris$Petal.Length, iris$Petal.Width,
xlab = "Petal Length (cm)",
ylab = "Petal Width (cm)",
main = "Relationship between Petal Length and Petal Width",
pch = 19,
col = "steelblue")
```
### b) Fit the model
```{r problem1b}
# Fit simple linear regression model
model1 <- lm(Petal.Width ~ Petal.Length, data = iris)
# Display model summary
summary(model1)
```
### c) Interpretation
- **Slope coefficient:** The slope is 0.4158. This means that for every 1 cm increase in petal length, petal width increases by approximately 0.416 cm, on average. The coefficient is highly statistically significant (p < 2e-16).
- **R-squared value:** The R² is 0.9271, meaning that 92.71% of the variance in petal width is explained by petal length. This indicates a very strong linear relationship between the two variables.
- **Statistical significance:** The F-statistic is 1882 with p < 2.2e-16, indicating the model is highly statistically significant. The predictor (Petal.Length) is also highly significant (p < 2e-16), meaning there is strong evidence of a linear relationship between petal length and width.
### d) Regression line plot
```{r problem1d}
# Create scatter plot with regression line and confidence bands
plot(iris$Petal.Length, iris$Petal.Width,
xlab = "Petal Length (cm)",
ylab = "Petal Width (cm)",
main = "Regression Line with 95% Confidence Bands",
pch = 19,
col = "steelblue")
# Add regression line
abline(model1, col = "red", lwd = 2)
# Add confidence bands
pred_data <- data.frame(Petal.Length = seq(min(iris$Petal.Length),
max(iris$Petal.Length),
length.out = 100))
conf_int <- predict(model1, newdata = pred_data, interval = "confidence")
lines(pred_data$Petal.Length, conf_int[, "lwr"], col = "darkgreen", lty = 2)
lines(pred_data$Petal.Length, conf_int[, "upr"], col = "darkgreen", lty = 2)
legend("topleft",
legend = c("Regression Line", "95% Confidence Bands"),
col = c("red", "darkgreen"),
lty = c(1, 2),
lwd = c(2, 1))
```
### e) Prediction
```{r problem1e}
# Predict petal width for petal length of 4.5 cm
new_data <- data.frame(Petal.Length = 4.5)
prediction <- predict(model1, newdata = new_data, interval = "prediction", level = 0.95)
print(prediction)
cat("\nPredicted petal width:", round(prediction[1], 3), "cm")
cat("\n95% Prediction Interval: [", round(prediction[2], 3), ",",
round(prediction[3], 3), "] cm")
```
For a flower with a petal length of 4.5 cm, we predict the petal width to be approximately 1.53 cm. We are 95% confident that the actual petal width for an individual flower with a petal length of 4.5 cm will fall between 1.12 cm and 1.94 cm.
---
## Problem 2: Multiple Linear Regression
### a) Fit multiple regression model
```{r problem2a}
# Fit multiple linear regression model
model2 <- lm(Petal.Width ~ Petal.Length + Sepal.Length, data = iris)
# Display model summary
summary(model2)
```
### b) Model comparison
```{r problem2b}
# Compare models
cat("Simple Linear Regression (Model 1):\n")
cat("R-squared:", summary(model1)$r.squared, "\n")
cat("Adjusted R-squared:", summary(model1)$adj.r.squared, "\n")
cat("Residual Standard Error:", summary(model1)$sigma, "\n\n")
cat("Multiple Linear Regression (Model 2):\n")
cat("R-squared:", summary(model2)$r.squared, "\n")
cat("Adjusted R-squared:", summary(model2)$adj.r.squared, "\n")
cat("Residual Standard Error:", summary(model2)$sigma, "\n\n")
# ANOVA comparison
anova(model1, model2)
```
The multiple regression model (Model 2) fits the data better than the simple regression model (Model 1). Evidence for this includes:
1. **R-squared improvement:** Model 2 has R² = 0.9379 compared to Model 1's R² = 0.9271, explaining an additional 1.08% of variance in petal width.
2. **Adjusted R-squared:** Model 2's adjusted R² (0.9370) is higher than Model 1's (0.9266), accounting for the additional predictor.
3. **Residual Standard Error:** Model 2 has a lower RSE (0.1980) compared to Model 1 (0.2065), indicating better prediction accuracy.
4. **ANOVA F-test:** The ANOVA comparison shows that adding Sepal.Length significantly improves the model (p < 2.2e-16), indicating Model 2 is statistically significantly better than Model 1.
### c) Coefficient interpretation
In the multiple regression model, the coefficient for Petal.Length is 0.5279, which differs from the simple regression coefficient of 0.4158.
This difference occurs due to **confounding** and the **control of additional variables**. In the simple regression, the Petal.Length coefficient captures both its direct effect on Petal.Width and any indirect effects through its correlation with Sepal.Length.
In the multiple regression model, the Petal.Length coefficient (0.5279) represents the effect of petal length on petal width while **holding sepal length constant**. This partial effect is larger, suggesting that when we account for sepal length, the relationship between petal length and width is even stronger than it appeared in the simple model.
The Sepal.Length coefficient (-0.2091) is negative and significant, indicating that flowers with longer sepals tend to have narrower petals when petal length is held constant. This negative relationship was "hidden" in the simple regression model.
### d) Diagnostic plots
```{r problem2d}
# Create all four diagnostic plots
par(mfrow = c(2, 2))
plot(model2)
par(mfrow = c(1, 1))
```
Based on the diagnostic plots, the regression assumptions appear to be reasonably well met:
1. **Residuals vs Fitted (Linearity):** The plot shows a relatively random scatter around the horizontal line at zero, though there's a slight curved pattern. This suggests the linearity assumption is mostly satisfied but could potentially be improved.
2. **Q-Q Plot (Normality):** The points follow the diagonal line quite closely, with minor deviations in the tails. This indicates the residuals are approximately normally distributed, meeting the normality assumption adequately.
3. **Scale-Location (Homoscedasticity):** The points show relatively constant spread across fitted values, though there's slight fanning. The assumption of constant variance is reasonably met.
4. **Residuals vs Leverage (Influential points):** No points fall beyond Cook's distance contours (which aren't even visible), indicating there are no highly influential outliers that would unduly affect the regression results.
**Overall assessment:** The model assumptions are reasonably satisfied. The model appears appropriate for these data, though there may be minor non-linearity that could potentially be addressed with transformations or polynomial terms if needed for more precise predictions.

1918
HW7.html Normal file

File diff suppressed because one or more lines are too long

BIN
HW7.pdf Normal file

Binary file not shown.