Skip to content

data

The data module provides access to the Big Ten dataset from the College Scorecard.

Module Overview

This module provides:

  • Big Ten College Scorecard data (1996-2023)
  • Flexible data loading with filtering
  • Dataset summary and metadata functions
  • 38 variables covering enrollment, admissions, completion, tuition, and demographics

Functions

load_bigten_data(institutions=None, years=None, columns=None)

Load BigTen institutional historical dataset (1996-2023).

This dataset contains historical data from the College Scorecard for all 18 Big Ten Conference institutions from 1996 to 2023.

Args: institutions: Filter for specific institutions (default: all 18) years: Filter for specific years (default: all years 1996-2023) columns: Select specific columns (default: all columns)

Returns: pandas DataFrame with BigTen institutional data

Raises: FileNotFoundError: If BigTen.csv is not found ValueError: If invalid institutions or years specified

Dataset Information: - 504 rows (18 institutions × 28 years) - 38 columns including: * Institutional characteristics (public/private, land-grant, AAU) * Enrollment data (total, by gender, by race/ethnicity) * Admission rates * Completion rates * Tuition and fees * Cost of attendance * Pell grant recipients

Examples: >>> # Load all data >>> df = load_bigten_data() >>> print(df.shape) (504, 38)

>>> # Filter for specific institutions
>>> df_msu_um = load_bigten_data(institutions=['MSU', 'Michigan'])
>>> print(df_msu_um['name'].unique())
['MSU', 'Michigan']

>>> # Filter for specific years
>>> df_recent = load_bigten_data(years=[2020, 2021, 2022, 2023])
>>> print(len(df_recent))
72

>>> # Select specific columns
>>> df_small = load_bigten_data(
...     institutions=['MSU'],
...     columns=['name', 'entry_term', 'UGDS', 'ADM_RATE']
... )

>>> # MSU enrollment over time
>>> msu_data = load_bigten_data(institutions=['MSU'])
>>> print(msu_data[['entry_term', 'UGDS']].head())
Source code in msuthemes/data/data_loader.py
def load_bigten_data(
    institutions: Optional[List[str]] = None,
    years: Optional[List[int]] = None,
    columns: Optional[List[str]] = None
) -> pd.DataFrame:
    """Load BigTen institutional historical dataset (1996-2023).

    This dataset contains historical data from the College Scorecard for
    all 18 Big Ten Conference institutions from 1996 to 2023.

    Args:
        institutions: Filter for specific institutions (default: all 18)
        years: Filter for specific years (default: all years 1996-2023)
        columns: Select specific columns (default: all columns)

    Returns:
        pandas DataFrame with BigTen institutional data

    Raises:
        FileNotFoundError: If BigTen.csv is not found
        ValueError: If invalid institutions or years specified

    Dataset Information:
        - 504 rows (18 institutions × 28 years)
        - 38 columns including:
          * Institutional characteristics (public/private, land-grant, AAU)
          * Enrollment data (total, by gender, by race/ethnicity)
          * Admission rates
          * Completion rates
          * Tuition and fees
          * Cost of attendance
          * Pell grant recipients

    Examples:
        >>> # Load all data
        >>> df = load_bigten_data()
        >>> print(df.shape)
        (504, 38)

        >>> # Filter for specific institutions
        >>> df_msu_um = load_bigten_data(institutions=['MSU', 'Michigan'])
        >>> print(df_msu_um['name'].unique())
        ['MSU', 'Michigan']

        >>> # Filter for specific years
        >>> df_recent = load_bigten_data(years=[2020, 2021, 2022, 2023])
        >>> print(len(df_recent))
        72

        >>> # Select specific columns
        >>> df_small = load_bigten_data(
        ...     institutions=['MSU'],
        ...     columns=['name', 'entry_term', 'UGDS', 'ADM_RATE']
        ... )

        >>> # MSU enrollment over time
        >>> msu_data = load_bigten_data(institutions=['MSU'])
        >>> print(msu_data[['entry_term', 'UGDS']].head())
    """
    # Get data file path
    data_path = get_data_path() / "BigTen.csv"

    if not data_path.exists():
        raise FileNotFoundError(
            f"BigTen dataset not found at {data_path}. "
            "Please ensure the package was installed correctly."
        )

    # Load the dataset
    try:
        df = pd.read_csv(data_path)
    except Exception as e:
        raise IOError(f"Error loading BigTen dataset: {e}")

    # Validate and filter institutions
    if institutions is not None:
        # Normalize institution names
        from msuthemes.bigten import normalize_institution_name

        normalized_institutions = []
        invalid_institutions = []
        for inst in institutions:
            try:
                normalized = normalize_institution_name(inst)
                # Handle USC/USoCal difference
                if normalized == "USC":
                    normalized = "USoCal"
                normalized_institutions.append(normalized)
            except ValueError as e:
                invalid_institutions.append((inst, str(e)))

        # If no valid institutions, raise error without warning
        if not normalized_institutions:
            if len(invalid_institutions) == 1:
                raise ValueError(invalid_institutions[0][1])
            else:
                raise ValueError("No valid institutions specified")

        # If some valid and some invalid, warn about invalid ones
        if invalid_institutions:
            for inst, error in invalid_institutions:
                warnings.warn(f"Skipping invalid institution '{inst}': {error}")

        df = df[df['name'].isin(normalized_institutions)]

    # Filter years
    if years is not None:
        years = [int(y) for y in years]  # Ensure integers
        min_year, max_year = df['entry_term'].min(), df['entry_term'].max()

        # Validate years
        invalid_years = [y for y in years if y < min_year or y > max_year]
        if invalid_years:
            warnings.warn(
                f"Years {invalid_years} are outside the dataset range "
                f"({min_year:.0f}-{max_year:.0f}). These will be ignored."
            )

        df = df[df['entry_term'].isin(years)]

    # Select columns
    if columns is not None:
        # Validate columns
        invalid_cols = [c for c in columns if c not in df.columns]
        if invalid_cols:
            raise ValueError(
                f"Invalid columns: {invalid_cols}. "
                f"Available columns: {list(df.columns)}"
            )
        df = df[columns]

    # Reset index
    df = df.reset_index(drop=True)

    return df

get_bigten_summary()

Get summary statistics for the BigTen dataset.

Returns: DataFrame with summary statistics by institution

Examples: >>> summary = get_bigten_summary() >>> print(summary.head())

Source code in msuthemes/data/data_loader.py
def get_bigten_summary() -> pd.DataFrame:
    """Get summary statistics for the BigTen dataset.

    Returns:
        DataFrame with summary statistics by institution

    Examples:
        >>> summary = get_bigten_summary()
        >>> print(summary.head())
    """
    df = load_bigten_data()

    summary = df.groupby('name').agg({
        'entry_term': ['min', 'max', 'count'],
        'UGDS': ['mean', 'min', 'max'],
        'ADM_RATE': 'mean',
        'C150_4': 'mean',
    }).round(4)

    summary.columns = ['_'.join(col).strip() for col in summary.columns.values]

    return summary.reset_index()

get_dataset_info()

Get information about the BigTen dataset.

Returns: Dictionary with dataset information

Examples: >>> info = get_dataset_info() >>> print(info['n_institutions']) 18 >>> print(info['years']) (1996, 2023)

Source code in msuthemes/data/data_loader.py
def get_dataset_info() -> dict:
    """Get information about the BigTen dataset.

    Returns:
        Dictionary with dataset information

    Examples:
        >>> info = get_dataset_info()
        >>> print(info['n_institutions'])
        18
        >>> print(info['years'])
        (1996, 2023)
    """
    try:
        df = load_bigten_data()

        return {
            'n_rows': len(df),
            'n_columns': len(df.columns),
            'n_institutions': df['name'].nunique(),
            'institutions': sorted(df['name'].unique().tolist()),
            'years': (int(df['entry_term'].min()), int(df['entry_term'].max())),
            'n_years': df['entry_term'].nunique(),
            'columns': df.columns.tolist(),
        }
    except FileNotFoundError:
        return {
            'error': 'Dataset not found',
            'message': 'BigTen.csv is not available in the data directory'
        }

list_available_datasets()

List all available datasets.

Returns: List of dataset filenames

Examples: >>> datasets = list_available_datasets() >>> print(datasets) ['BigTen.csv']

Source code in msuthemes/data/data_loader.py
def list_available_datasets() -> List[str]:
    """List all available datasets.

    Returns:
        List of dataset filenames

    Examples:
        >>> datasets = list_available_datasets()
        >>> print(datasets)
        ['BigTen.csv']
    """
    data_dir = get_data_path()
    if not data_dir.exists():
        return []

    return sorted([
        f.name for f in data_dir.glob("*.csv")
    ])

Usage Examples

Basic Data Loading

from msuthemes import load_bigten_data

# Load complete dataset
df = load_bigten_data()
print(df.shape)  # (504, 38)
print(df.columns)

Filtering by Institution

from msuthemes import load_bigten_data

# Load data for single institution
msu_data = load_bigten_data(institutions=['Michigan State'])

# Load data for multiple institutions
rivalry = load_bigten_data(
    institutions=['Michigan State', 'Michigan', 'Ohio State']
)

# Using abbreviations
msu_data = load_bigten_data(institutions=['MSU'])

Filtering by Year

from msuthemes import load_bigten_data

# Load specific years
recent = load_bigten_data(years=[2020, 2021, 2022, 2023])

# Load year range
decade = load_bigten_data(years=list(range(2010, 2020)))

# Load single year
df_2023 = load_bigten_data(years=[2023])

Filtering by Columns

from msuthemes import load_bigten_data

# Load specific columns
df = load_bigten_data(
    columns=['INSTNM', 'YEAR', 'ADM_RATE', 'UGDS']
)

# Combine filters
msu_recent = load_bigten_data(
    institutions=['Michigan State'],
    years=list(range(2015, 2024)),
    columns=['YEAR', 'ADM_RATE', 'C150_4']
)

Getting Dataset Summary

from msuthemes import get_bigten_summary

# Get summary statistics
summary = get_bigten_summary()
print(summary)

Getting Dataset Information

from msuthemes import get_dataset_info

# Get metadata
info = get_dataset_info()
print(info['description'])
print(info['source'])
print(info['years'])
print(info['institutions'])

Listing Available Datasets

from msuthemes import list_available_datasets

# List all datasets
datasets = list_available_datasets()
print(datasets)

Complete Analysis Example

from msuthemes import (
    load_bigten_data,
    bigten_palette,
    theme_msu
)
import matplotlib.pyplot as plt

# Apply theme
theme_msu(grid=True)

# Load and filter data
schools = ['Michigan State', 'Michigan', 'Wisconsin']
df = load_bigten_data(
    institutions=schools,
    years=list(range(2010, 2024)),
    columns=['INSTNM', 'YEAR', 'ADM_RATE']
)

# Get colors
colors = bigten_palette(schools)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))

for school, color in zip(schools, colors):
    school_data = df[df['INSTNM'] == school]
    ax.plot(school_data['YEAR'],
            school_data['ADM_RATE'] * 100,
            color=color,
            linewidth=2.5,
            marker='o',
            label=school)

ax.set_xlabel('Year')
ax.set_ylabel('Admission Rate (%)')
ax.set_title('Admission Rate Trends')
ax.legend()

plt.tight_layout()
plt.show()

Dataset Variables

Institution Identifiers

  • UNITID: Unique institution ID
  • INSTNM: Institution name
  • YEAR: Academic year

Enrollment

  • UGDS: Total undergraduate enrollment
  • UG: Number of undergraduates
  • GRAD: Number of graduates

Admissions

  • ADM_RATE: Admission rate (0-1)
  • ADM_RATE_ALL: Overall admission rate
  • SATVRMID, SATMTMID, SATWRMID: SAT score midpoints
  • ACTCMMID, ACTENMID, ACTMTMID, ACTWRMID: ACT score midpoints

Completion

  • C150_4: 6-year completion rate (0-1)
  • C200_4: 8-year completion rate (0-1)

Costs

  • TUITIONFEE_IN: In-state tuition and fees
  • TUITIONFEE_OUT: Out-of-state tuition and fees
  • TUITIONFEE_PROG: Program tuition

Demographics

  • UGDS_WHITE: % White undergraduates
  • UGDS_BLACK: % Black undergraduates
  • UGDS_HISP: % Hispanic undergraduates
  • UGDS_ASIAN: % Asian undergraduates
  • UGDS_AIAN: % American Indian/Alaska Native
  • UGDS_NHPI: % Native Hawaiian/Pacific Islander
  • UGDS_2MOR: % Two or more races
  • UGDS_NRA: % Non-resident aliens
  • UGDS_UNKN: % Race/ethnicity unknown

Financial Aid

  • PCTPELL: % receiving Pell Grants
  • PCTFLOAN: % receiving federal loans

Institution Characteristics

  • CONTROL: Public/private control
  • HIGHDEG: Highest degree awarded
  • REGION: Geographic region
  • LOCALE: Urban/rural classification

Data Quality Notes

  • Missing values exist for some variables/years
  • Institution names match list_bigten_institutions()
  • Data sourced from U.S. Department of Education College Scorecard
  • Updated through 2023

Working with Missing Data

from msuthemes import load_bigten_data
import pandas as pd

# Load data
df = load_bigten_data()

# Check missing values
print(df.isnull().sum())

# Drop rows with missing values in specific column
df_clean = df.dropna(subset=['ADM_RATE'])

# Fill missing values with mean
df_filled = df.fillna(df.mean(numeric_only=True))

See Also