---
title: "Assignment 7"
subtitle: "STAT3373"
author: "Isaac Shoebottom"
date: "Dec 4th, 2025"
output:
  html_document:
    df_print: paged
  pdf_document: default
---

```{r setup, include=FALSE}
library(ggplot2)
```

## Problem 1: Simple Linear Regression

### a) Scatter plot

```{r problem1a}
# Load the iris dataset
data(iris)

# Create scatter plot
plot(iris$Petal.Length, iris$Petal.Width,
     xlab = "Petal Length (cm)",
     ylab = "Petal Width (cm)",
     main = "Relationship between Petal Length and Petal Width",
     pch = 19,
     col = "steelblue")
```

### b) Fit the model

```{r problem1b}
# Fit simple linear regression model
model1 <- lm(Petal.Width ~ Petal.Length, data = iris)

# Display model summary
summary(model1)
```

### c) Interpretation

- **Slope coefficient:** The slope is 0.4158. This means that for every 1 cm increase in petal length, petal width increases by approximately 0.416 cm, on average. The coefficient is highly statistically significant (p < 2e-16).

- **R-squared value:** The R² is 0.9271, meaning that 92.71% of the variance in petal width is explained by petal length. This indicates a very strong linear relationship between the two variables.

- **Statistical significance:** The F-statistic is 1882 with p < 2.2e-16, indicating the model is highly statistically significant. The predictor (Petal.Length) is also highly significant (p < 2e-16), meaning there is strong evidence of a linear relationship between petal length and width.

### d) Regression line plot

```{r problem1d}
# Create scatter plot with regression line and confidence bands
plot(iris$Petal.Length, iris$Petal.Width,
     xlab = "Petal Length (cm)",
     ylab = "Petal Width (cm)",
     main = "Regression Line with 95% Confidence Bands",
     pch = 19,
     col = "steelblue")

# Add regression line
abline(model1, col = "red", lwd = 2)

# Add confidence bands
pred_data <- data.frame(Petal.Length = seq(min(iris$Petal.Length), 
                                           max(iris$Petal.Length), 
                                           length.out = 100))
conf_int <- predict(model1, newdata = pred_data, interval = "confidence")

lines(pred_data$Petal.Length, conf_int[, "lwr"], col = "darkgreen", lty = 2)
lines(pred_data$Petal.Length, conf_int[, "upr"], col = "darkgreen", lty = 2)

legend("topleft", 
       legend = c("Regression Line", "95% Confidence Bands"),
       col = c("red", "darkgreen"),
       lty = c(1, 2),
       lwd = c(2, 1))
```

### e) Prediction

```{r problem1e}
# Predict petal width for petal length of 4.5 cm
new_data <- data.frame(Petal.Length = 4.5)
prediction <- predict(model1, newdata = new_data, interval = "prediction", level = 0.95)

print(prediction)

cat("\nPredicted petal width:", round(prediction[1], 3), "cm")
cat("\n95% Prediction Interval: [", round(prediction[2], 3), ",", 
    round(prediction[3], 3), "] cm")
```

For a flower with a petal length of 4.5 cm, we predict the petal width to be approximately 1.53 cm. We are 95% confident that the actual petal width for an individual flower with a petal length of 4.5 cm will fall between 1.12 cm and 1.94 cm.

---

## Problem 2: Multiple Linear Regression

### a) Fit multiple regression model

```{r problem2a}
# Fit multiple linear regression model
model2 <- lm(Petal.Width ~ Petal.Length + Sepal.Length, data = iris)

# Display model summary
summary(model2)
```

### b) Model comparison

```{r problem2b}
# Compare models
cat("Simple Linear Regression (Model 1):\n")
cat("R-squared:", summary(model1)$r.squared, "\n")
cat("Adjusted R-squared:", summary(model1)$adj.r.squared, "\n")
cat("Residual Standard Error:", summary(model1)$sigma, "\n\n")

cat("Multiple Linear Regression (Model 2):\n")
cat("R-squared:", summary(model2)$r.squared, "\n")
cat("Adjusted R-squared:", summary(model2)$adj.r.squared, "\n")
cat("Residual Standard Error:", summary(model2)$sigma, "\n\n")

# ANOVA comparison
anova(model1, model2)
```

The multiple regression model (Model 2) fits the data better than the simple regression model (Model 1). Evidence for this includes:

1. **R-squared improvement:** Model 2 has R² = 0.9379 compared to Model 1's R² = 0.9271, explaining an additional 1.08% of variance in petal width.

2. **Adjusted R-squared:** Model 2's adjusted R² (0.9370) is higher than Model 1's (0.9266), accounting for the additional predictor.

3. **Residual Standard Error:** Model 2 has a lower RSE (0.1980) compared to Model 1 (0.2065), indicating better prediction accuracy.

4. **ANOVA F-test:** The ANOVA comparison shows that adding Sepal.Length significantly improves the model (p < 2.2e-16), indicating Model 2 is statistically significantly better than Model 1.

### c) Coefficient interpretation

In the multiple regression model, the coefficient for Petal.Length is 0.5279, which differs from the simple regression coefficient of 0.4158. 

This difference occurs due to **confounding** and the **control of additional variables**. In the simple regression, the Petal.Length coefficient captures both its direct effect on Petal.Width and any indirect effects through its correlation with Sepal.Length. 

In the multiple regression model, the Petal.Length coefficient (0.5279) represents the effect of petal length on petal width while **holding sepal length constant**. This partial effect is larger, suggesting that when we account for sepal length, the relationship between petal length and width is even stronger than it appeared in the simple model.

The Sepal.Length coefficient (-0.2091) is negative and significant, indicating that flowers with longer sepals tend to have narrower petals when petal length is held constant. This negative relationship was "hidden" in the simple regression model.

### d) Diagnostic plots

```{r problem2d}
# Create all four diagnostic plots
par(mfrow = c(2, 2))
plot(model2)
par(mfrow = c(1, 1))
```

Based on the diagnostic plots, the regression assumptions appear to be reasonably well met:

1. **Residuals vs Fitted (Linearity):** The plot shows a relatively random scatter around the horizontal line at zero, though there's a slight curved pattern. This suggests the linearity assumption is mostly satisfied but could potentially be improved.

2. **Q-Q Plot (Normality):** The points follow the diagonal line quite closely, with minor deviations in the tails. This indicates the residuals are approximately normally distributed, meeting the normality assumption adequately.

3. **Scale-Location (Homoscedasticity):** The points show relatively constant spread across fitted values, though there's slight fanning. The assumption of constant variance is reasonably met.

4. **Residuals vs Leverage (Influential points):** No points fall beyond Cook's distance contours (which aren't even visible), indicating there are no highly influential outliers that would unduly affect the regression results.

**Overall assessment:** The model assumptions are reasonably satisfied. The model appears appropriate for these data, though there may be minor non-linearity that could potentially be addressed with transformations or polynomial terms if needed for more precise predictions.