data¶
The data module provides access to the Big Ten dataset from the College Scorecard.
Module Overview¶
This module provides:
- Big Ten College Scorecard data (1996-2023)
- Flexible data loading with filtering
- Dataset summary and metadata functions
- 38 variables covering enrollment, admissions, completion, tuition, and demographics
Functions¶
load_bigten_data(institutions=None, years=None, columns=None)
¶
Load BigTen institutional historical dataset (1996-2023).
This dataset contains historical data from the College Scorecard for all 18 Big Ten Conference institutions from 1996 to 2023.
Args: institutions: Filter for specific institutions (default: all 18) years: Filter for specific years (default: all years 1996-2023) columns: Select specific columns (default: all columns)
Returns: pandas DataFrame with BigTen institutional data
Raises: FileNotFoundError: If BigTen.csv is not found ValueError: If invalid institutions or years specified
Dataset Information: - 504 rows (18 institutions × 28 years) - 38 columns including: * Institutional characteristics (public/private, land-grant, AAU) * Enrollment data (total, by gender, by race/ethnicity) * Admission rates * Completion rates * Tuition and fees * Cost of attendance * Pell grant recipients
Examples: >>> # Load all data >>> df = load_bigten_data() >>> print(df.shape) (504, 38)
>>> # Filter for specific institutions
>>> df_msu_um = load_bigten_data(institutions=['MSU', 'Michigan'])
>>> print(df_msu_um['name'].unique())
['MSU', 'Michigan']
>>> # Filter for specific years
>>> df_recent = load_bigten_data(years=[2020, 2021, 2022, 2023])
>>> print(len(df_recent))
72
>>> # Select specific columns
>>> df_small = load_bigten_data(
... institutions=['MSU'],
... columns=['name', 'entry_term', 'UGDS', 'ADM_RATE']
... )
>>> # MSU enrollment over time
>>> msu_data = load_bigten_data(institutions=['MSU'])
>>> print(msu_data[['entry_term', 'UGDS']].head())
Source code in msuthemes/data/data_loader.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | |
get_bigten_summary()
¶
Get summary statistics for the BigTen dataset.
Returns: DataFrame with summary statistics by institution
Examples: >>> summary = get_bigten_summary() >>> print(summary.head())
Source code in msuthemes/data/data_loader.py
get_dataset_info()
¶
Get information about the BigTen dataset.
Returns: Dictionary with dataset information
Examples: >>> info = get_dataset_info() >>> print(info['n_institutions']) 18 >>> print(info['years']) (1996, 2023)
Source code in msuthemes/data/data_loader.py
list_available_datasets()
¶
List all available datasets.
Returns: List of dataset filenames
Examples: >>> datasets = list_available_datasets() >>> print(datasets) ['BigTen.csv']
Source code in msuthemes/data/data_loader.py
Usage Examples¶
Basic Data Loading¶
from msuthemes import load_bigten_data
# Load complete dataset
df = load_bigten_data()
print(df.shape) # (504, 38)
print(df.columns)
Filtering by Institution¶
from msuthemes import load_bigten_data
# Load data for single institution
msu_data = load_bigten_data(institutions=['Michigan State'])
# Load data for multiple institutions
rivalry = load_bigten_data(
institutions=['Michigan State', 'Michigan', 'Ohio State']
)
# Using abbreviations
msu_data = load_bigten_data(institutions=['MSU'])
Filtering by Year¶
from msuthemes import load_bigten_data
# Load specific years
recent = load_bigten_data(years=[2020, 2021, 2022, 2023])
# Load year range
decade = load_bigten_data(years=list(range(2010, 2020)))
# Load single year
df_2023 = load_bigten_data(years=[2023])
Filtering by Columns¶
from msuthemes import load_bigten_data
# Load specific columns
df = load_bigten_data(
columns=['INSTNM', 'YEAR', 'ADM_RATE', 'UGDS']
)
# Combine filters
msu_recent = load_bigten_data(
institutions=['Michigan State'],
years=list(range(2015, 2024)),
columns=['YEAR', 'ADM_RATE', 'C150_4']
)
Getting Dataset Summary¶
from msuthemes import get_bigten_summary
# Get summary statistics
summary = get_bigten_summary()
print(summary)
Getting Dataset Information¶
from msuthemes import get_dataset_info
# Get metadata
info = get_dataset_info()
print(info['description'])
print(info['source'])
print(info['years'])
print(info['institutions'])
Listing Available Datasets¶
from msuthemes import list_available_datasets
# List all datasets
datasets = list_available_datasets()
print(datasets)
Complete Analysis Example¶
from msuthemes import (
load_bigten_data,
bigten_palette,
theme_msu
)
import matplotlib.pyplot as plt
# Apply theme
theme_msu(grid=True)
# Load and filter data
schools = ['Michigan State', 'Michigan', 'Wisconsin']
df = load_bigten_data(
institutions=schools,
years=list(range(2010, 2024)),
columns=['INSTNM', 'YEAR', 'ADM_RATE']
)
# Get colors
colors = bigten_palette(schools)
# Plot
fig, ax = plt.subplots(figsize=(10, 6))
for school, color in zip(schools, colors):
school_data = df[df['INSTNM'] == school]
ax.plot(school_data['YEAR'],
school_data['ADM_RATE'] * 100,
color=color,
linewidth=2.5,
marker='o',
label=school)
ax.set_xlabel('Year')
ax.set_ylabel('Admission Rate (%)')
ax.set_title('Admission Rate Trends')
ax.legend()
plt.tight_layout()
plt.show()
Dataset Variables¶
Institution Identifiers¶
UNITID: Unique institution IDINSTNM: Institution nameYEAR: Academic year
Enrollment¶
UGDS: Total undergraduate enrollmentUG: Number of undergraduatesGRAD: Number of graduates
Admissions¶
ADM_RATE: Admission rate (0-1)ADM_RATE_ALL: Overall admission rateSATVRMID,SATMTMID,SATWRMID: SAT score midpointsACTCMMID,ACTENMID,ACTMTMID,ACTWRMID: ACT score midpoints
Completion¶
C150_4: 6-year completion rate (0-1)C200_4: 8-year completion rate (0-1)
Costs¶
TUITIONFEE_IN: In-state tuition and feesTUITIONFEE_OUT: Out-of-state tuition and feesTUITIONFEE_PROG: Program tuition
Demographics¶
UGDS_WHITE: % White undergraduatesUGDS_BLACK: % Black undergraduatesUGDS_HISP: % Hispanic undergraduatesUGDS_ASIAN: % Asian undergraduatesUGDS_AIAN: % American Indian/Alaska NativeUGDS_NHPI: % Native Hawaiian/Pacific IslanderUGDS_2MOR: % Two or more racesUGDS_NRA: % Non-resident aliensUGDS_UNKN: % Race/ethnicity unknown
Financial Aid¶
PCTPELL: % receiving Pell GrantsPCTFLOAN: % receiving federal loans
Institution Characteristics¶
CONTROL: Public/private controlHIGHDEG: Highest degree awardedREGION: Geographic regionLOCALE: Urban/rural classification
Data Quality Notes¶
- Missing values exist for some variables/years
- Institution names match
list_bigten_institutions() - Data sourced from U.S. Department of Education College Scorecard
- Updated through 2023
Working with Missing Data¶
from msuthemes import load_bigten_data
import pandas as pd
# Load data
df = load_bigten_data()
# Check missing values
print(df.isnull().sum())
# Drop rows with missing values in specific column
df_clean = df.dropna(subset=['ADM_RATE'])
# Fill missing values with mean
df_filled = df.fillna(df.mean(numeric_only=True))
See Also¶
- Dataset Guide - Detailed dataset usage guide
- Big Ten - Big Ten utilities
- Gallery - Data visualization examples