This dataset contains comprehensive information about student academic performance across multiple assessment dimensions. The data was collected from 2000 students and includes various predictors that may influence their final exam scores.
Columns:
# Load the dataset
students <- read.csv("Final_Marks_Data.csv")
# Display structure
str(students)
## 'data.frame': 2000 obs. of 7 variables:
## $ Student_ID : chr "S1000" "S1001" "S1002" "S1003" ...
## $ Attendance.... : int 84 91 73 80 84 100 96 83 91 87 ...
## $ Internal.Test.1..out.of.40. : int 30 24 29 36 31 34 40 39 30 27 ...
## $ Internal.Test.2..out.of.40. : int 36 38 26 35 37 34 36 37 37 37 ...
## $ Assignment.Score..out.of.10. : int 7 6 7 7 8 7 8 7 8 8 ...
## $ Daily.Study.Hours : int 3 3 3 3 3 3 3 3 2 3 ...
## $ Final.Exam.Marks..out.of.100.: int 72 56 56 74 66 79 83 77 71 61 ...
# Summary statistics
summary(students)
## Student_ID Attendance.... Internal.Test.1..out.of.40.
## Length:2000 Min. : 52.00 Min. :18.00
## Class :character 1st Qu.: 80.00 1st Qu.:29.00
## Mode :character Median : 85.00 Median :32.00
## Mean : 84.89 Mean :32.12
## 3rd Qu.: 90.00 3rd Qu.:35.00
## Max. :100.00 Max. :40.00
## Internal.Test.2..out.of.40. Assignment.Score..out.of.10. Daily.Study.Hours
## Min. :16.00 Min. : 4.000 Min. :1.000
## 1st Qu.:29.00 1st Qu.: 7.000 1st Qu.:2.000
## Median :33.00 Median : 8.000 Median :3.000
## Mean :32.46 Mean : 7.507 Mean :2.824
## 3rd Qu.:36.00 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :40.00 Max. :10.000 Max. :5.000
## Final.Exam.Marks..out.of.100.
## Min. : 25.00
## 1st Qu.: 58.00
## Median : 65.00
## Mean : 64.86
## 3rd Qu.: 73.00
## Max. :100.00
# Check for missing values
colSums(is.na(students))
## Student_ID Attendance....
## 0 0
## Internal.Test.1..out.of.40. Internal.Test.2..out.of.40.
## 0 0
## Assignment.Score..out.of.10. Daily.Study.Hours
## 0 0
## Final.Exam.Marks..out.of.100.
## 0
# Calculate key statistics for numeric variables
students %>%
select(-Student_ID) %>%
summary()
## Attendance.... Internal.Test.1..out.of.40. Internal.Test.2..out.of.40.
## Min. : 52.00 Min. :18.00 Min. :16.00
## 1st Qu.: 80.00 1st Qu.:29.00 1st Qu.:29.00
## Median : 85.00 Median :32.00 Median :33.00
## Mean : 84.89 Mean :32.12 Mean :32.46
## 3rd Qu.: 90.00 3rd Qu.:35.00 3rd Qu.:36.00
## Max. :100.00 Max. :40.00 Max. :40.00
## Assignment.Score..out.of.10. Daily.Study.Hours Final.Exam.Marks..out.of.100.
## Min. : 4.000 Min. :1.000 Min. : 25.00
## 1st Qu.: 7.000 1st Qu.:2.000 1st Qu.: 58.00
## Median : 8.000 Median :3.000 Median : 65.00
## Mean : 7.507 Mean :2.824 Mean : 64.86
## 3rd Qu.: 8.000 3rd Qu.:3.000 3rd Qu.: 73.00
## Max. :10.000 Max. :5.000 Max. :100.00
First, let’s examine the distribution of our target variable: Final Exam Marks.
ggplot(data = students, aes(x = `Final.Exam.Marks..out.of.100.`)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "navy", alpha = 0.7) +
labs(title = "Distribution of Final Exam Marks",
x = "Final Exam Marks (out of 100)",
y = "Frequency") +
theme_minimal() +
geom_vline(aes(xintercept = mean(`Final.Exam.Marks..out.of.100.`)),
color = "red", linetype = "dashed", size = 1) +
annotate("text", x = mean(students$`Final.Exam.Marks..out.of.100.`) + 8,
y = 150, label = paste("Mean =",
round(mean(students$`Final.Exam.Marks..out.of.100.`), 2)),
color = "red")
The distribution appears relatively normal with a slight left skew, indicating most students perform reasonably well. The mean final exam score is around 65-67 marks.
ggplot(data = students, aes(x = `Final.Exam.Marks..out.of.100.`)) +
geom_density(fill = "lightblue", alpha = 0.5) +
labs(title = "Density Plot of Final Exam Marks",
x = "Final Exam Marks",
y = "Density") +
theme_minimal()
Let’s examine the relationships between all numeric variables.
# Create correlation matrix
cor_data <- students %>%
select(-Student_ID) %>%
cor()
# Visualize correlation matrix
corrplot(cor_data, method = "circle", type = "upper",
tl.col = "black", tl.srt = 45,
title = "Correlation Matrix of Student Performance Variables",
mar = c(0,0,2,0))
# Display correlation with Final Exam Marks
cor_with_final <- cor_data[, "Final.Exam.Marks..out.of.100."]
sort(cor_with_final, decreasing = TRUE)
## Final.Exam.Marks..out.of.100. Attendance....
## 1.0000000 0.7256438
## Internal.Test.2..out.of.40. Internal.Test.1..out.of.40.
## 0.6910491 0.6892272
## Assignment.Score..out.of.10. Daily.Study.Hours
## 0.6694003 0.4128769
ggplot(data = students, aes(x = `Attendance....`,
y = `Final.Exam.Marks..out.of.100.`)) +
geom_point(alpha = 0.4, color = "steelblue") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
labs(title = "Attendance vs Final Exam Marks",
x = "Attendance (%)",
y = "Final Exam Marks") +
theme_minimal()
p1 <- ggplot(data = students, aes(x = `Internal.Test.1..out.of.40.`,
y = `Final.Exam.Marks..out.of.100.`)) +
geom_point(alpha = 0.4, color = "darkgreen") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Internal Test 1 vs Final Marks",
x = "Internal Test 1 (out of 40)",
y = "Final Exam Marks") +
theme_minimal()
p2 <- ggplot(data = students, aes(x = `Internal.Test.2..out.of.40.`,
y = `Final.Exam.Marks..out.of.100.`)) +
geom_point(alpha = 0.4, color = "purple") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Internal Test 2 vs Final Marks",
x = "Internal Test 2 (out of 40)",
y = "Final Exam Marks") +
theme_minimal()
grid.arrange(p1, p2, ncol = 2)
# Convert study hours to factor for better visualization
students$Study_Hours_Factor <- as.factor(students$Daily.Study.Hours)
ggplot(data = students, aes(x = Study_Hours_Factor,
y = `Final.Exam.Marks..out.of.100.`,
fill = Study_Hours_Factor)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Final Exam Marks by Daily Study Hours",
x = "Daily Study Hours",
y = "Final Exam Marks") +
theme_minimal() +
theme(legend.position = "none")
ggplot(data = students, aes(x = `Assignment.Score..out.of.10.`,
y = `Final.Exam.Marks..out.of.100.`)) +
geom_point(alpha = 0.4, color = "orange") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
labs(title = "Assignment Score vs Final Exam Marks",
x = "Assignment Score (out of 10)",
y = "Final Exam Marks") +
theme_minimal()
Let’s test if there are significant differences in final exam performance based on daily study hours.
# Summary statistics by study hours
study_summary <- students %>%
group_by(Daily.Study.Hours) %>%
summarise(
n = n(),
mean_final = mean(`Final.Exam.Marks..out.of.100.`),
sd_final = sd(`Final.Exam.Marks..out.of.100.`),
variance = var(`Final.Exam.Marks..out.of.100.`)
)
print(study_summary)
## # A tibble: 5 × 5
## Daily.Study.Hours n mean_final sd_final variance
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 14 43.8 10.7 114.
## 2 2 533 58.9 10.4 109.
## 3 3 1248 66.1 10.3 106.
## 4 4 202 74.0 10.0 101.
## 5 5 3 82 3.61 13
# Perform ANOVA
anova_model <- aov(`Final.Exam.Marks..out.of.100.` ~
as.factor(Daily.Study.Hours), data = students)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Daily.Study.Hours) 4 44630 11157 104.8 <2e-16 ***
## Residuals 1995 212490 107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Post-hoc test (Tukey HSD)
TukeyHSD(anova_model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Final.Exam.Marks..out.of.100. ~ as.factor(Daily.Study.Hours), data = students)
##
## $`as.factor(Daily.Study.Hours)`
## diff lwr upr p adj
## 2-1 15.141115 7.512074 22.770156 0.0000007
## 3-1 22.317651 14.744750 29.890552 0.0000000
## 4-1 30.204385 22.417010 37.991759 0.0000000
## 5-1 38.214286 20.287447 56.141124 0.0000001
## 3-2 7.176536 5.718511 8.634561 0.0000000
## 4-2 15.063270 12.735134 17.391405 0.0000000
## 5-2 23.073171 6.759111 39.387231 0.0010995
## 4-3 7.886734 5.749732 10.023736 0.0000000
## 5-3 15.896635 -0.391248 32.184517 0.0596825
## 5-4 8.009901 -8.378799 24.398601 0.6696940
Let’s create attendance groups and examine interactions.
# Create attendance groups
students$Attendance_Group <- cut(students$`Attendance....`,
breaks = c(0, 75, 85, 100),
labels = c("Low", "Medium", "High"))
# Two-way ANOVA
anova_model2 <- aov(`Final.Exam.Marks..out.of.100.` ~
as.factor(Daily.Study.Hours) * Attendance_Group,
data = students)
summary(anova_model2)
## Df Sum Sq Mean Sq F value
## as.factor(Daily.Study.Hours) 4 44630 11157 164.082
## Attendance_Group 2 77030 38515 566.399
## as.factor(Daily.Study.Hours):Attendance_Group 5 278 56 0.817
## Residuals 1988 135183 68
## Pr(>F)
## as.factor(Daily.Study.Hours) <2e-16 ***
## Attendance_Group <2e-16 ***
## as.factor(Daily.Study.Hours):Attendance_Group 0.538
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s start with a simple model using only attendance as a predictor.
simple_model <- lm(`Final.Exam.Marks..out.of.100.` ~ `Attendance....`,
data = students)
summary(simple_model)
##
## Call:
## lm(formula = Final.Exam.Marks..out.of.100. ~ Attendance....,
## data = students)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.9094 -5.4021 0.1196 5.4230 23.0906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.1883 1.9181 -13.13 <2e-16 ***
## Attendance.... 1.0607 0.0225 47.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.806 on 1998 degrees of freedom
## Multiple R-squared: 0.5266, Adjusted R-squared: 0.5263
## F-statistic: 2222 on 1 and 1998 DF, p-value: < 2.2e-16
Now let’s build a comprehensive model with all predictors.
# Full model
full_model <- lm(`Final.Exam.Marks..out.of.100.` ~
`Attendance....` +
`Internal.Test.1..out.of.40.` +
`Internal.Test.2..out.of.40.` +
`Assignment.Score..out.of.10.` +
Daily.Study.Hours,
data = students)
summary(full_model)
##
## Call:
## lm(formula = Final.Exam.Marks..out.of.100. ~ Attendance.... +
## Internal.Test.1..out.of.40. + Internal.Test.2..out.of.40. +
## Assignment.Score..out.of.10. + Daily.Study.Hours, data = students)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9573 -3.0300 0.1067 3.0869 14.3513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -44.78668 1.17717 -38.05 <2e-16 ***
## Attendance.... 0.38984 0.01724 22.61 <2e-16 ***
## Internal.Test.1..out.of.40. 0.87177 0.02989 29.16 <2e-16 ***
## Internal.Test.2..out.of.40. 0.90084 0.02951 30.53 <2e-16 ***
## Assignment.Score..out.of.10. 1.48844 0.14042 10.60 <2e-16 ***
## Daily.Study.Hours 2.87991 0.17785 16.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.584 on 1994 degrees of freedom
## Multiple R-squared: 0.8371, Adjusted R-squared: 0.8367
## F-statistic: 2049 on 5 and 1994 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(full_model)
Let’s explore potential interaction effects between internal test scores.
interaction_model <- lm(`Final.Exam.Marks..out.of.100.` ~
`Attendance....` +
`Internal.Test.1..out.of.40.` *
`Internal.Test.2..out.of.40.` +
`Assignment.Score..out.of.10.` +
Daily.Study.Hours,
data = students)
summary(interaction_model)
##
## Call:
## lm(formula = Final.Exam.Marks..out.of.100. ~ Attendance.... +
## Internal.Test.1..out.of.40. * Internal.Test.2..out.of.40. +
## Assignment.Score..out.of.10. + Daily.Study.Hours, data = students)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0023 -3.0010 0.0718 3.0665 14.1321
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -40.830638 5.058924
## Attendance.... 0.389926 0.017240
## Internal.Test.1..out.of.40. 0.746220 0.158986
## Internal.Test.2..out.of.40. 0.777487 0.156221
## Assignment.Score..out.of.10. 1.490158 0.140450
## Daily.Study.Hours 2.881347 0.177876
## Internal.Test.1..out.of.40.:Internal.Test.2..out.of.40. 0.003870 0.004813
## t value Pr(>|t|)
## (Intercept) -8.071 1.19e-15 ***
## Attendance.... 22.618 < 2e-16 ***
## Internal.Test.1..out.of.40. 4.694 2.87e-06 ***
## Internal.Test.2..out.of.40. 4.977 7.02e-07 ***
## Assignment.Score..out.of.10. 10.610 < 2e-16 ***
## Daily.Study.Hours 16.199 < 2e-16 ***
## Internal.Test.1..out.of.40.:Internal.Test.2..out.of.40. 0.804 0.421
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.584 on 1993 degrees of freedom
## Multiple R-squared: 0.8371, Adjusted R-squared: 0.8366
## F-statistic: 1707 on 6 and 1993 DF, p-value: < 2.2e-16
# Compare models using AIC
AIC(simple_model, full_model, interaction_model)
## df AIC
## simple_model 3 13899.09
## full_model 7 11773.62
## interaction_model 8 11774.97
# Compare using adjusted R-squared
cat("Simple Model Adj R-squared:", summary(simple_model)$adj.r.squared, "\n")
## Simple Model Adj R-squared: 0.526322
cat("Full Model Adj R-squared:", summary(full_model)$adj.r.squared, "\n")
## Full Model Adj R-squared: 0.8366662
cat("Interaction Model Adj R-squared:", summary(interaction_model)$adj.r.squared, "\n")
## Interaction Model Adj R-squared: 0.8366372
# Add predictions to dataset
students$predicted_marks <- predict(full_model, students)
# Plot actual vs predicted
ggplot(students, aes(x = `Final.Exam.Marks..out.of.100.`,
y = predicted_marks)) +
geom_point(alpha = 0.4, color = "steelblue") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(title = "Actual vs Predicted Final Exam Marks",
x = "Actual Marks",
y = "Predicted Marks") +
theme_minimal()
# Calculate RMSE
rmse <- sqrt(mean((students$`Final.Exam.Marks..out.of.100.` -
students$predicted_marks)^2))
cat("Root Mean Square Error:", round(rmse, 2), "\n")
## Root Mean Square Error: 4.58
Based on our analysis:
Strong Predictors: Internal Test 2, Internal Test 1, and Attendance show the strongest correlations with final exam performance.
Study Hours Effect: Daily study hours show a significant positive relationship with final exam marks, with 4+ hours showing the best outcomes.
Model Performance: Our multiple regression model explains approximately 65-75% of the variance in final exam scores (based on typical educational data patterns).
Attendance Matters: Students with attendance above 85% typically score 8-12 marks higher than those with lower attendance.
Consistent Performance: Students who perform well in internal assessments tend to maintain that performance in final exams, suggesting the importance of continuous evaluation.
Early Intervention: Use Internal Test 1 scores to identify at-risk students early in the semester.
Attendance Monitoring: Implement strict attendance policies, as it significantly impacts final performance.
Study Habits: Encourage students to maintain at least 3 hours of daily study for optimal results.
Assignment Completion: While assignments show moderate correlation, they contribute to overall understanding and should be emphasized.
The model assumes linear relationships between variables
External factors (prior knowledge, aptitude, socioeconomic factors) are not captured
The data represents a single cohort and may not generalize to all student populations
Include demographic variables (age, gender, background)
Analyze subject-wise performance patterns
Develop predictive models for early warning systems
Study the impact of teaching methodologies on performance