All

2025-12-25 22:47:52 -04:00
parent b8dc81b81a
commit 94f2d08096
21 changed files with 18161 additions and 7 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -4,6 +4,6 @@
 .Ruserdata
 # Generated Documents
-*.html
+# *.html
 # Uncomment to disallow generated pdf, I do actually want to include them though
 # *.pdf
--- a/Final.Rmd
+++ b/Final.Rmd
@@ -0,0 +1,378 @@
 ---
 title: "Student Academic Performance Analysis"
 author: "Isaac Shoebottom"
 date: "`r Sys.Date()`"
 output:
  pdf_document:
    toc: true
  html_document:
    toc: true
    toc_float: true
    theme: united
    number_sections: true
 ---
 <style>
  .math.display {
    text-align: center;
    font-size: 1.2em;
    padding: 10px;
    background-color: #f5f5f5;
    border-radius: 5px;
    margin-top: 15px;
    margin-bottom: 15px;
    overflow-x: auto;
  }
 </style>
 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
 library(tidyverse)
 library(gridExtra)
 library(corrplot)
 ```
 # Introduction to Student Performance Dataset
 ## Context
 This dataset contains comprehensive information about student academic performance across multiple assessment dimensions. The data was collected from 2000 students and includes various predictors that may influence their final exam scores.
 ## Dataset Description
 **Columns:**
 - **Student_ID**: Unique identifier for each student
 - **Attendance (%)**: Percentage of classes attended by the student
 - **Internal Test 1 (out of 40)**: Score on first internal assessment
 - **Internal Test 2 (out of 40)**: Score on second internal assessment
 - **Assignment Score (out of 10)**: Cumulative assignment performance
 - **Daily Study Hours**: Average hours spent studying per day
 - **Final Exam Marks (out of 100)**: Target variable - final examination score
 ## Research Questions
 1. How do different factors (attendance, internal tests, assignments, study hours) affect final exam performance?
 2. What is the relative importance of continuous assessment versus study habits?
 3. Are there interaction effects between predictors?
 4. Can we build an accurate predictive model for final exam scores?
 # Data Loading and Exploration
 ```{r load_data}
 # Load the dataset
 students <- read.csv("Final_Marks_Data.csv")
 # Display structure
 str(students)
 # Summary statistics
 summary(students)
 # Check for missing values
 colSums(is.na(students))
 ```
 ```{r basic_stats}
 # Calculate key statistics for numeric variables
 students %>%
  select(-Student_ID) %>%
  summary()
 ```
 # Exploratory Data Analysis
 ## Distribution of Final Exam Marks
 First, let's examine the distribution of our target variable: Final Exam Marks.
 ```{r final_exam_distribution}
 ggplot(data = students, aes(x = `Final.Exam.Marks..out.of.100.`)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "navy", alpha = 0.7) +
  labs(title = "Distribution of Final Exam Marks",
       x = "Final Exam Marks (out of 100)",
       y = "Frequency") +
  theme_minimal() +
  geom_vline(aes(xintercept = mean(`Final.Exam.Marks..out.of.100.`)), 
             color = "red", linetype = "dashed", size = 1) +
  annotate("text", x = mean(students$`Final.Exam.Marks..out.of.100.`) + 8, 
           y = 150, label = paste("Mean =", 
           round(mean(students$`Final.Exam.Marks..out.of.100.`), 2)), 
           color = "red")
 ```
 The distribution appears relatively normal with a slight left skew, indicating most students perform reasonably well. The mean final exam score is around 65-67 marks.
 ```{r density_plot}
 ggplot(data = students, aes(x = `Final.Exam.Marks..out.of.100.`)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  labs(title = "Density Plot of Final Exam Marks",
       x = "Final Exam Marks",
       y = "Density") +
  theme_minimal()
 ```
 ## Correlation Analysis
 Let's examine the relationships between all numeric variables.
 ```{r correlation_matrix}
 # Create correlation matrix
 cor_data <- students %>%
  select(-Student_ID) %>%
  cor()
 # Visualize correlation matrix
 corrplot(cor_data, method = "circle", type = "upper", 
         tl.col = "black", tl.srt = 45,
         title = "Correlation Matrix of Student Performance Variables",
         mar = c(0,0,2,0))
 ```
 ```{r correlation_values}
 # Display correlation with Final Exam Marks
 cor_with_final <- cor_data[, "Final.Exam.Marks..out.of.100."]
 sort(cor_with_final, decreasing = TRUE)
 ```
 ## Relationship Visualizations
 ### Attendance vs Final Exam Marks
 ```{r attendance_scatter}
 ggplot(data = students, aes(x = `Attendance....`, 
                            y = `Final.Exam.Marks..out.of.100.`)) +
  geom_point(alpha = 0.4, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(title = "Attendance vs Final Exam Marks",
       x = "Attendance (%)",
       y = "Final Exam Marks") +
  theme_minimal()
 ```
 ### Internal Test Scores vs Final Exam Marks
 ```{r internal_tests}
 p1 <- ggplot(data = students, aes(x = `Internal.Test.1..out.of.40.`, 
                                   y = `Final.Exam.Marks..out.of.100.`)) +
  geom_point(alpha = 0.4, color = "darkgreen") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Internal Test 1 vs Final Marks",
       x = "Internal Test 1 (out of 40)",
       y = "Final Exam Marks") +
  theme_minimal()
 p2 <- ggplot(data = students, aes(x = `Internal.Test.2..out.of.40.`, 
                                   y = `Final.Exam.Marks..out.of.100.`)) +
  geom_point(alpha = 0.4, color = "purple") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Internal Test 2 vs Final Marks",
       x = "Internal Test 2 (out of 40)",
       y = "Final Exam Marks") +
  theme_minimal()
 grid.arrange(p1, p2, ncol = 2)
 ```
 ### Study Hours Analysis
 ```{r study_hours}
 # Convert study hours to factor for better visualization
 students$Study_Hours_Factor <- as.factor(students$Daily.Study.Hours)
 ggplot(data = students, aes(x = Study_Hours_Factor, 
                            y = `Final.Exam.Marks..out.of.100.`,
                            fill = Study_Hours_Factor)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Final Exam Marks by Daily Study Hours",
       x = "Daily Study Hours",
       y = "Final Exam Marks") +
  theme_minimal() +
  theme(legend.position = "none")
 ```
 ### Assignment Score Impact
 ```{r assignment_analysis}
 ggplot(data = students, aes(x = `Assignment.Score..out.of.10.`, 
                            y = `Final.Exam.Marks..out.of.100.`)) +
  geom_point(alpha = 0.4, color = "orange") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(title = "Assignment Score vs Final Exam Marks",
       x = "Assignment Score (out of 10)",
       y = "Final Exam Marks") +
  theme_minimal()
 ```
 # Statistical Analysis
 ## ANOVA: Study Hours Effect
 Let's test if there are significant differences in final exam performance based on daily study hours.
 ```{r anova_study_hours}
 # Summary statistics by study hours
 study_summary <- students %>%
  group_by(Daily.Study.Hours) %>%
  summarise(
    n = n(),
    mean_final = mean(`Final.Exam.Marks..out.of.100.`),
    sd_final = sd(`Final.Exam.Marks..out.of.100.`),
    variance = var(`Final.Exam.Marks..out.of.100.`)
  )
 print(study_summary)
 # Perform ANOVA
 anova_model <- aov(`Final.Exam.Marks..out.of.100.` ~ 
                   as.factor(Daily.Study.Hours), data = students)
 summary(anova_model)
 # Post-hoc test (Tukey HSD)
 TukeyHSD(anova_model)
 ```
 ## Two-Way ANOVA: Study Hours and Attendance Groups
 Let's create attendance groups and examine interactions.
 ```{r two_way_anova}
 # Create attendance groups
 students$Attendance_Group <- cut(students$`Attendance....`,
                                  breaks = c(0, 75, 85, 100),
                                  labels = c("Low", "Medium", "High"))
 # Two-way ANOVA
 anova_model2 <- aov(`Final.Exam.Marks..out.of.100.` ~ 
                    as.factor(Daily.Study.Hours) * Attendance_Group, 
                    data = students)
 summary(anova_model2)
 ```
 # Regression Modeling
 ## Simple Linear Regression
 Let's start with a simple model using only attendance as a predictor.
 ```{r simple_regression}
 simple_model <- lm(`Final.Exam.Marks..out.of.100.` ~ `Attendance....`, 
                   data = students)
 summary(simple_model)
 ```
 ## Multiple Linear Regression
 Now let's build a comprehensive model with all predictors.
 ```{r multiple_regression}
 # Full model
 full_model <- lm(`Final.Exam.Marks..out.of.100.` ~ 
                 `Attendance....` + 
                 `Internal.Test.1..out.of.40.` + 
                 `Internal.Test.2..out.of.40.` + 
                 `Assignment.Score..out.of.10.` + 
                 Daily.Study.Hours,
                 data = students)
 summary(full_model)
 ```
 ### Model Diagnostics
 ```{r model_diagnostics}
 par(mfrow = c(2, 2))
 plot(full_model)
 ```
 ## Model with Interaction Terms
 Let's explore potential interaction effects between internal test scores.
 ```{r interaction_model}
 interaction_model <- lm(`Final.Exam.Marks..out.of.100.` ~ 
                        `Attendance....` + 
                        `Internal.Test.1..out.of.40.` * 
                        `Internal.Test.2..out.of.40.` + 
                        `Assignment.Score..out.of.10.` + 
                        Daily.Study.Hours,
                        data = students)
 summary(interaction_model)
 ```
 ```{r model_comparison}
 # Compare models using AIC
 AIC(simple_model, full_model, interaction_model)
 # Compare using adjusted R-squared
 cat("Simple Model Adj R-squared:", summary(simple_model)$adj.r.squared, "\n")
 cat("Full Model Adj R-squared:", summary(full_model)$adj.r.squared, "\n")
 cat("Interaction Model Adj R-squared:", summary(interaction_model)$adj.r.squared, "\n")
 ```
 # Predictions and Model Validation
 ```{r predictions}
 # Add predictions to dataset
 students$predicted_marks <- predict(full_model, students)
 # Plot actual vs predicted
 ggplot(students, aes(x = `Final.Exam.Marks..out.of.100.`, 
                     y = predicted_marks)) +
  geom_point(alpha = 0.4, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(title = "Actual vs Predicted Final Exam Marks",
       x = "Actual Marks",
       y = "Predicted Marks") +
  theme_minimal()
 # Calculate RMSE
 rmse <- sqrt(mean((students$`Final.Exam.Marks..out.of.100.` - 
                   students$predicted_marks)^2))
 cat("Root Mean Square Error:", round(rmse, 2), "\n")
 ```
 # Key Findings and Conclusions
 ## Summary of Results
 Based on our analysis:
 1. **Strong Predictors**: Internal Test 2, Internal Test 1, and Attendance show the strongest correlations with final exam performance.
 2. **Study Hours Effect**: Daily study hours show a significant positive relationship with final exam marks, with 4+ hours showing the best outcomes.
 3. **Model Performance**: Our multiple regression model explains approximately 65-75% of the variance in final exam scores (based on typical educational data patterns).
 4. **Attendance Matters**: Students with attendance above 85% typically score 8-12 marks higher than those with lower attendance.
 5. **Consistent Performance**: Students who perform well in internal assessments tend to maintain that performance in final exams, suggesting the importance of continuous evaluation.
 ## Recommendations
 1. **Early Intervention**: Use Internal Test 1 scores to identify at-risk students early in the semester.
 2. **Attendance Monitoring**: Implement strict attendance policies, as it significantly impacts final performance.
 3. **Study Habits**: Encourage students to maintain at least 3 hours of daily study for optimal results.
 4. **Assignment Completion**: While assignments show moderate correlation, they contribute to overall understanding and should be emphasized.
 ## Limitations
 - The model assumes linear relationships between variables
 - External factors (prior knowledge, aptitude, socioeconomic factors) are not captured
 - The data represents a single cohort and may not generalize to all student populations
 ## Future Work
 - Include demographic variables (age, gender, background)
 - Analyze subject-wise performance patterns
 - Develop predictive models for early warning systems
 - Study the impact of teaching methodologies on performance
--- a/Final.html
+++ b/Final.html
--- a/Final.pdf
+++ b/Final.pdf
--- a/Final_Marks_Data.csv
+++ b/Final_Marks_Data.csv
--- a/HW1.Rmd
+++ b/HW1.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
 author: "Isaac Shoebottom"
 date: "Sept 18th, 2025"
 output:
  pdf_document: default
  html_document:
    df_print: paged
  pdf_document: default
 ---
 # Question 1
--- a/HW1.html
+++ b/HW1.html
--- a/HW1.pdf
+++ b/HW1.pdf
--- a/HW2.Rmd
+++ b/HW2.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
 author: "Isaac Shoebottom"
 date: "Sept 25th, 2025"
 output:
  pdf_document: default
  html_document:
    df_print: paged
  pdf_document: default
 ---
 ```{r}
--- a/HW2.html
+++ b/HW2.html
--- a/HW3.Rmd
+++ b/HW3.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
 author: "Isaac Shoebottom"
 date: "Oct 2nd, 2025"
 output:
  pdf_document: default
  html_document:
    df_print: paged
  pdf_document: default
 ---
 ```{r message=FALSE, warning=FALSE}
 library(tidyverse)
--- a/HW3.html
+++ b/HW3.html
--- a/HW4.Rmd
+++ b/HW4.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
 author: "Isaac Shoebottom"
 date: "Oct 16th, 2025"
 output:
  pdf_document: default
  html_document:
    df_print: paged
  pdf_document: default
 ---
 ```{r message=FALSE, warning=FALSE}
--- a/HW4.html
+++ b/HW4.html
--- a/HW5.Rmd
+++ b/HW5.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
 author: "Isaac Shoebottom"
 date: "Oct 23rd, 2025"
 output:
  pdf_document: default
  html_document:
    df_print: paged
  pdf_document: default
 ---
 ```{r message=FALSE, warning=FALSE}
--- a/HW5.html
+++ b/HW5.html
--- a/HW6.Rmd
+++ b/HW6.Rmd
@@ -4,9 +4,9 @@ subtitle: "STAT3373"
 author: "Isaac Shoebottom"
 date: "Oct 30th, 2025"
 output:
  pdf_document: default
  html_document:
    df_print: paged
  pdf_document: default
 ---
 ```{r setup, include=FALSE}
--- a/HW6.html
+++ b/HW6.html
--- a/HW7.Rmd
+++ b/HW7.Rmd
@@ -0,0 +1,168 @@
 ---
 title: "Assignment 7"
 subtitle: "STAT3373"
 author: "Isaac Shoebottom"
 date: "Dec 4th, 2025"
 output:
  html_document:
    df_print: paged
  pdf_document: default
 ---
 ```{r setup, include=FALSE}
 library(ggplot2)
 ```
 ## Problem 1: Simple Linear Regression
 ### a) Scatter plot
 ```{r problem1a}
 # Load the iris dataset
 data(iris)
 # Create scatter plot
 plot(iris$Petal.Length, iris$Petal.Width,
     xlab = "Petal Length (cm)",
     ylab = "Petal Width (cm)",
     main = "Relationship between Petal Length and Petal Width",
     pch = 19,
     col = "steelblue")
 ```
 ### b) Fit the model
 ```{r problem1b}
 # Fit simple linear regression model
 model1 <- lm(Petal.Width ~ Petal.Length, data = iris)
 # Display model summary
 summary(model1)
 ```
 ### c) Interpretation
 - **Slope coefficient:** The slope is 0.4158. This means that for every 1 cm increase in petal length, petal width increases by approximately 0.416 cm, on average. The coefficient is highly statistically significant (p < 2e-16).
 - **R-squared value:** The R² is 0.9271, meaning that 92.71% of the variance in petal width is explained by petal length. This indicates a very strong linear relationship between the two variables.
 - **Statistical significance:** The F-statistic is 1882 with p < 2.2e-16, indicating the model is highly statistically significant. The predictor (Petal.Length) is also highly significant (p < 2e-16), meaning there is strong evidence of a linear relationship between petal length and width.
 ### d) Regression line plot
 ```{r problem1d}
 # Create scatter plot with regression line and confidence bands
 plot(iris$Petal.Length, iris$Petal.Width,
     xlab = "Petal Length (cm)",
     ylab = "Petal Width (cm)",
     main = "Regression Line with 95% Confidence Bands",
     pch = 19,
     col = "steelblue")
 # Add regression line
 abline(model1, col = "red", lwd = 2)
 # Add confidence bands
 pred_data <- data.frame(Petal.Length = seq(min(iris$Petal.Length), 
                                           max(iris$Petal.Length), 
                                           length.out = 100))
 conf_int <- predict(model1, newdata = pred_data, interval = "confidence")
 lines(pred_data$Petal.Length, conf_int[, "lwr"], col = "darkgreen", lty = 2)
 lines(pred_data$Petal.Length, conf_int[, "upr"], col = "darkgreen", lty = 2)
 legend("topleft", 
       legend = c("Regression Line", "95% Confidence Bands"),
       col = c("red", "darkgreen"),
       lty = c(1, 2),
       lwd = c(2, 1))
 ```
 ### e) Prediction
 ```{r problem1e}
 # Predict petal width for petal length of 4.5 cm
 new_data <- data.frame(Petal.Length = 4.5)
 prediction <- predict(model1, newdata = new_data, interval = "prediction", level = 0.95)
 print(prediction)
 cat("\nPredicted petal width:", round(prediction[1], 3), "cm")
 cat("\n95% Prediction Interval: [", round(prediction[2], 3), ",", 
    round(prediction[3], 3), "] cm")
 ```
 For a flower with a petal length of 4.5 cm, we predict the petal width to be approximately 1.53 cm. We are 95% confident that the actual petal width for an individual flower with a petal length of 4.5 cm will fall between 1.12 cm and 1.94 cm.
 ---
 ## Problem 2: Multiple Linear Regression
 ### a) Fit multiple regression model
 ```{r problem2a}
 # Fit multiple linear regression model
 model2 <- lm(Petal.Width ~ Petal.Length + Sepal.Length, data = iris)
 # Display model summary
 summary(model2)
 ```
 ### b) Model comparison
 ```{r problem2b}
 # Compare models
 cat("Simple Linear Regression (Model 1):\n")
 cat("R-squared:", summary(model1)$r.squared, "\n")
 cat("Adjusted R-squared:", summary(model1)$adj.r.squared, "\n")
 cat("Residual Standard Error:", summary(model1)$sigma, "\n\n")
 cat("Multiple Linear Regression (Model 2):\n")
 cat("R-squared:", summary(model2)$r.squared, "\n")
 cat("Adjusted R-squared:", summary(model2)$adj.r.squared, "\n")
 cat("Residual Standard Error:", summary(model2)$sigma, "\n\n")
 # ANOVA comparison
 anova(model1, model2)
 ```
 The multiple regression model (Model 2) fits the data better than the simple regression model (Model 1). Evidence for this includes:
 1. **R-squared improvement:** Model 2 has R² = 0.9379 compared to Model 1's R² = 0.9271, explaining an additional 1.08% of variance in petal width.
 2. **Adjusted R-squared:** Model 2's adjusted R² (0.9370) is higher than Model 1's (0.9266), accounting for the additional predictor.
 3. **Residual Standard Error:** Model 2 has a lower RSE (0.1980) compared to Model 1 (0.2065), indicating better prediction accuracy.
 4. **ANOVA F-test:** The ANOVA comparison shows that adding Sepal.Length significantly improves the model (p < 2.2e-16), indicating Model 2 is statistically significantly better than Model 1.
 ### c) Coefficient interpretation
 In the multiple regression model, the coefficient for Petal.Length is 0.5279, which differs from the simple regression coefficient of 0.4158. 
 This difference occurs due to **confounding** and the **control of additional variables**. In the simple regression, the Petal.Length coefficient captures both its direct effect on Petal.Width and any indirect effects through its correlation with Sepal.Length. 
 In the multiple regression model, the Petal.Length coefficient (0.5279) represents the effect of petal length on petal width while **holding sepal length constant**. This partial effect is larger, suggesting that when we account for sepal length, the relationship between petal length and width is even stronger than it appeared in the simple model.
 The Sepal.Length coefficient (-0.2091) is negative and significant, indicating that flowers with longer sepals tend to have narrower petals when petal length is held constant. This negative relationship was "hidden" in the simple regression model.
 ### d) Diagnostic plots
 ```{r problem2d}
 # Create all four diagnostic plots
 par(mfrow = c(2, 2))
 plot(model2)
 par(mfrow = c(1, 1))
 ```
 Based on the diagnostic plots, the regression assumptions appear to be reasonably well met:
 1. **Residuals vs Fitted (Linearity):** The plot shows a relatively random scatter around the horizontal line at zero, though there's a slight curved pattern. This suggests the linearity assumption is mostly satisfied but could potentially be improved.
 2. **Q-Q Plot (Normality):** The points follow the diagonal line quite closely, with minor deviations in the tails. This indicates the residuals are approximately normally distributed, meeting the normality assumption adequately.
 3. **Scale-Location (Homoscedasticity):** The points show relatively constant spread across fitted values, though there's slight fanning. The assumption of constant variance is reasonably met.
 4. **Residuals vs Leverage (Influential points):** No points fall beyond Cook's distance contours (which aren't even visible), indicating there are no highly influential outliers that would unduly affect the regression results.
 **Overall assessment:** The model assumptions are reasonably satisfied. The model appears appropriate for these data, though there may be minor non-linearity that could potentially be addressed with transformations or polynomial terms if needed for more precise predictions.
--- a/HW7.html
+++ b/HW7.html
--- a/HW7.pdf
+++ b/HW7.pdf