Datasets¶
MSUthemes includes a curated dataset of Big Ten Conference institutions from the U.S. Department of Education's College Scorecard. This dataset enables comparative analyses and visualizations across the conference.
Overview¶
The Big Ten dataset includes:
- Time Period: 1996-2023 (28 years)
- Institutions: All 18 Big Ten schools
- Variables: 38 metrics including enrollment, admissions, completion rates, tuition, and demographics
- Total Observations: 504 rows (18 institutions × 28 years)
Loading Data¶
Basic Loading¶
from msuthemes import load_bigten_data
# Load complete dataset
df = load_bigten_data()
print(df.shape) # (504, 38)
print(df.columns)
Filtering by Institution¶
from msuthemes import load_bigten_data
# Load data for a single institution
msu_data = load_bigten_data(institutions=['Michigan State'])
# Load data for multiple institutions
rivalry_data = load_bigten_data(
institutions=['Michigan State', 'Michigan', 'Ohio State']
)
# Using abbreviations works too
msu_data = load_bigten_data(institutions=['MSU'])
Filtering by Year¶
from msuthemes import load_bigten_data
# Load recent years only
recent_data = load_bigten_data(years=[2020, 2021, 2022, 2023])
# Load a specific year range
decade_data = load_bigten_data(years=list(range(2010, 2020)))
Filtering by Columns¶
from msuthemes import load_bigten_data
# Load only specific columns
focused_data = load_bigten_data(
columns=['INSTNM', 'YEAR', 'ADM_RATE', 'UGDS', 'TUITIONFEE_IN']
)
# Combine filters
msu_recent = load_bigten_data(
institutions=['Michigan State'],
years=list(range(2015, 2024)),
columns=['YEAR', 'ADM_RATE', 'UGDS']
)
Dataset Variables¶
Institution Identifiers¶
UNITID: Unique institution identifierINSTNM: Institution nameYEAR: Academic year
Enrollment Metrics¶
UGDS: Total undergraduate enrollmentUG: Number of undergraduatesGRAD: Number of graduate students
Admissions¶
ADM_RATE: Admission rate (0-1)ADM_RATE_ALL: Overall admission rateSATVRMID: SAT verbal midpointSATMTMID: SAT math midpointSATWRMID: SAT writing midpointACTCMMID: ACT cumulative midpointACTENMID: ACT English midpointACTMTMID: ACT math midpointACTWRMID: ACT writing midpoint
Completion Rates¶
C150_4: 150% completion rate (6 years for 4-year programs)C200_4: 200% completion rate (8 years)
Tuition and Costs¶
TUITIONFEE_IN: In-state tuition and feesTUITIONFEE_OUT: Out-of-state tuition and feesTUITIONFEE_PROG: Program-specific tuition
Student Body Demographics¶
UGDS_WHITE: Percentage of white undergraduatesUGDS_BLACK: Percentage of Black undergraduatesUGDS_HISP: Percentage of Hispanic undergraduatesUGDS_ASIAN: Percentage of Asian undergraduatesUGDS_AIAN: Percentage of American Indian/Alaska Native undergraduatesUGDS_NHPI: Percentage of Native Hawaiian/Pacific Islander undergraduatesUGDS_2MOR: Percentage of two or more races undergraduatesUGDS_NRA: Percentage of non-resident alien undergraduatesUGDS_UNKN: Percentage of race/ethnicity unknown undergraduates
Financial Aid¶
PCTPELL: Percentage of students receiving Pell GrantsPCTFLOAN: Percentage of students receiving federal loans
Institutional Characteristics¶
CONTROL: Institution control (public/private)HIGHDEG: Highest degree awardedREGION: Geographic regionLOCALE: Urban/rural classification
Dataset Information¶
Get Summary Statistics¶
from msuthemes import get_bigten_summary
# Get summary of the dataset
summary = get_bigten_summary()
print(summary)
Get Dataset Metadata¶
from msuthemes import get_dataset_info
# Get information about the dataset
info = get_dataset_info()
print(info['description'])
print(info['source'])
print(info['years'])
print(info['institutions'])
List Available Datasets¶
from msuthemes import list_available_datasets
# List all available datasets (currently just BigTen.csv)
datasets = list_available_datasets()
print(datasets)
Analysis Examples¶
Admission Rate Trends¶
from msuthemes import load_bigten_data, bigten_palette, theme_msu
import matplotlib.pyplot as plt
# Setup
theme_msu()
# Load data
df = load_bigten_data(columns=['INSTNM', 'YEAR', 'ADM_RATE'])
# Filter to recent years and select schools
schools = ['Michigan State', 'Michigan', 'Wisconsin']
df_filtered = df[
(df['INSTNM'].isin(schools)) &
(df['YEAR'] >= 2010)
]
# Plot
colors = bigten_palette(schools)
fig, ax = plt.subplots(figsize=(10, 6))
for school, color in zip(schools, colors):
school_data = df_filtered[df_filtered['INSTNM'] == school]
ax.plot(school_data['YEAR'],
school_data['ADM_RATE'] * 100,
color=color,
linewidth=2.5,
marker='o',
label=school)
ax.set_xlabel('Year')
ax.set_ylabel('Admission Rate (%)')
ax.set_title('Admission Rate Trends')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Enrollment Comparison¶
from msuthemes import load_bigten_data, bigten_palette, theme_msu
import matplotlib.pyplot as plt
# Setup
theme_msu()
# Load most recent year
df = load_bigten_data(years=[2023], columns=['INSTNM', 'UGDS'])
# Sort by enrollment
df_sorted = df.sort_values('UGDS', ascending=True)
# Get colors
colors = bigten_palette(df_sorted['INSTNM'].tolist())
# Create horizontal bar chart
fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(range(len(df_sorted)),
df_sorted['UGDS'],
color=colors)
ax.set_yticks(range(len(df_sorted)))
ax.set_yticklabels(df_sorted['INSTNM'])
ax.set_xlabel('Undergraduate Enrollment')
ax.set_title('Big Ten Enrollment (2023)')
plt.tight_layout()
plt.show()
Tuition Analysis¶
from msuthemes import load_bigten_data, msu_qual1, theme_msu
import matplotlib.pyplot as plt
import numpy as np
# Setup
theme_msu()
# Load data
df = load_bigten_data(
years=[2023],
columns=['INSTNM', 'TUITIONFEE_IN', 'TUITIONFEE_OUT']
)
# Calculate difference
df['DIFF'] = df['TUITIONFEE_OUT'] - df['TUITIONFEE_IN']
df_sorted = df.sort_values('DIFF', ascending=False)
# Plot
fig, ax = plt.subplots(figsize=(10, 8))
colors = msu_qual1.as_hex()
x = np.arange(len(df_sorted))
width = 0.35
ax.barh(x - width/2, df_sorted['TUITIONFEE_IN'],
width, label='In-State', color=colors[0])
ax.barh(x + width/2, df_sorted['TUITIONFEE_OUT'],
width, label='Out-of-State', color=colors[1])
ax.set_yticks(x)
ax.set_yticklabels(df_sorted['INSTNM'])
ax.set_xlabel('Tuition and Fees ($)')
ax.set_title('Tuition Comparison (2023)')
ax.legend()
plt.tight_layout()
plt.show()
Completion Rate Analysis¶
from msuthemes import load_bigten_data, bigten_palette, theme_msu
import matplotlib.pyplot as plt
# Setup
theme_msu()
# Load data
df = load_bigten_data(
years=list(range(2015, 2024)),
columns=['INSTNM', 'YEAR', 'C150_4']
)
# Calculate average by institution
avg_completion = df.groupby('INSTNM')['C150_4'].mean().sort_values(ascending=False)
# Get colors
colors = bigten_palette(avg_completion.index.tolist())
# Plot
fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(range(len(avg_completion)),
avg_completion.values * 100,
color=colors)
ax.set_yticks(range(len(avg_completion)))
ax.set_yticklabels(avg_completion.index)
ax.set_xlabel('6-Year Completion Rate (%)')
ax.set_title('Average Completion Rates (2015-2023)')
plt.tight_layout()
plt.show()
Demographics Visualization¶
from msuthemes import load_bigten_data, msu_seq, theme_msu
import matplotlib.pyplot as plt
# Setup
theme_msu()
# Load demographics data
df = load_bigten_data(
institutions=['Michigan State'],
columns=['YEAR', 'UGDS_WHITE', 'UGDS_BLACK',
'UGDS_HISP', 'UGDS_ASIAN']
)
# Filter to recent years
df_recent = df[df['YEAR'] >= 2010].sort_values('YEAR')
# Plot stacked area chart
colors = msu_seq.as_hex()
fig, ax = plt.subplots(figsize=(10, 6))
ax.fill_between(df_recent['YEAR'],
0,
df_recent['UGDS_WHITE'] * 100,
label='White',
color=colors[0],
alpha=0.8)
ax.fill_between(df_recent['YEAR'],
df_recent['UGDS_WHITE'] * 100,
(df_recent['UGDS_WHITE'] + df_recent['UGDS_BLACK']) * 100,
label='Black',
color=colors[2],
alpha=0.8)
ax.set_xlabel('Year')
ax.set_ylabel('Percentage of Undergraduates')
ax.set_title('MSU Undergraduate Demographics Over Time')
ax.legend()
plt.tight_layout()
plt.show()
Data Quality Notes¶
- Missing values: Some variables have missing data for certain years
- Consistency: Institution names match those from
list_bigten_institutions() - Updates: Data reflects College Scorecard releases through 2023
- Filtering: Use pandas methods for additional data cleaning
Working with Missing Data¶
from msuthemes import load_bigten_data
import pandas as pd
# Load data
df = load_bigten_data()
# Check for missing values
print(df.isnull().sum())
# Drop rows with missing values in specific column
df_clean = df.dropna(subset=['ADM_RATE'])
# Fill missing values
df_filled = df.fillna(df.mean(numeric_only=True))
Combining with External Data¶
The dataset can be merged with external data sources:
from msuthemes import load_bigten_data
import pandas as pd
# Load Big Ten data
bigten_df = load_bigten_data()
# Your external data
external_df = pd.read_csv('my_data.csv')
# Merge on institution name and year
merged_df = pd.merge(
bigten_df,
external_df,
left_on=['INSTNM', 'YEAR'],
right_on=['Institution', 'Year'],
how='inner'
)
Tips and Best Practices¶
- Filter early: Use function parameters to filter data at load time
- Check data types: Ensure numeric columns are properly typed
- Handle missing data: Check for and handle missing values appropriately
- Verify institution names: Use
list_bigten_institutions()for consistency - Document sources: Note that data comes from College Scorecard when publishing
See Also¶
- Big Ten - Big Ten colors and palettes
- Gallery - Big Ten visualization examples
- API Reference - Data module documentation