stanbkt.utils#
Functions
|
Compute accuracy by thresholding predicted probabilities at 0.5. |
|
Compute the area under the ROC curve (AUC). |
|
Clear the compiled Stan model cache. |
|
Compile or reuse a cached Stan executable and return a CmdStan model. |
|
Format input data for BKT model fitting. |
Return the root cache directory for all compiled Stan models. |
|
|
Return the cache directory for a compiled Stan executable. |
|
Convert raw CmdStanGQ outputs into long-form draw DataFrames with remapped IDs. |
Check if the current system is Windows. |
|
|
List all cached Stan model directories. |
|
Package a directory of model artifacts into a .stanbktmod archive. |
|
Summarise posterior predictions into per-observation statistics. |
|
Compute root mean squared error between predicted probabilities and binary labels. |
|
Set up CmdStanPy by checking for CmdStan installation and setting the path if necessary. |
|
Simulate student problem responses under grouped BKT model. |
|
Simulate student problem responses under simple BKT model. |
|
Summarize posterior draws for each parameter. |
|
Return parameter names to include in fit summaries. |
|
Extract a StanBKT model archive into a target directory. |
|
Validate input data for BKT model fitting. |
Classes
|
Enumeration of standard column names for BKT data. |
|
Structured data container for a single knowledge component. |
|
Enumeration of verbosity levels for logging output. |
API Details
Utilities for StanBKT, including model compilation, data handling, verbosity control, and simulation functions.
- class stanbkt.utils.ColumnNames(*values)#
Bases:
StrEnumEnumeration of standard column names for BKT data.
- Variables:
STUDENT_ID (str) – Unique student identifier column name.
PROBLEM_ID (str) – Unique problem identifier column name.
CORRECTNESS (str) – Binary correctness column (1=correct, 0=incorrect).
ORDER (str) – The order in which the students attempted the problems.
KC_ID (str) – Knowledge component identifier column name.
GROUP (str) – Student or problem group identifier (optional).
- CORRECTNESS = 'correct'#
- GROUP = 'group_id'#
- KC_ID = 'kc_id'#
- ORDER = 'timestamp'#
- PROBLEM_ID = 'problem_id'#
- STUDENT_ID = 'student_id'#
- static apply_default_mapping(col_mapping=None)#
Apply default mapping to fill missing column name mappings.
For any standard column not in the provided mapping, uses the default (column name maps to itself).
- class stanbkt.utils.KCData(correctness, student_inter_dict, lengths, student_ids, problem_ids, groups=None, group_2_index=None)#
Bases:
objectStructured data container for a single knowledge component.
Stores preprocessed interaction data for a KC, including correctness matrix, student/problem identifiers, and optional group assignments.
- Variables:
correctness (np.ndarray) – Correctness matrix of shape (num_students, num_problems).
student_inter_dict (dict[str, StudentInteraction]) – Mapping of student IDs to their interaction sequences, used to preserve original problem IDs and interaction counts.
lengths (npt.NDArray[np.int32]) – Vector of interaction sequence lengths for each student.
student_ids (list[str]) – Student identifiers matching rows of correctness matrix.
problem_ids (list[str]) – Problem identifiers matching columns of correctness matrix.
groups (Optional[np.ndarray], default None) – Optional group array (e.g., student group assignments or problem difficulty groups).
group_2_index (Optional[dict[str, int]], default None) – Optional mapping from group ID to index in groups array.
- Parameters:
- correctness: ndarray#
- class stanbkt.utils.VerbosityLevel(*values)#
Bases:
IntEnumEnumeration of verbosity levels for logging output.
- Variables:
- DEBUG = 3#
- INFO = 2#
- WARN = 1#
- stanbkt.utils.accuracy(correctness, predictions)#
Compute accuracy by thresholding predicted probabilities at 0.5.
- Parameters:
correctness (
TypeAliasType) – Flattened binary correctness labels (0 or 1).predictions (
TypeAliasType) – Flattened predicted probabilities in [0, 1].
- Returns:
Proportion of correctly classified observations.
- Return type:
- stanbkt.utils.auc(correctness, predictions)#
Compute the area under the ROC curve (AUC).
- Parameters:
correctness (
TypeAliasType) – Flattened binary correctness labels (0 or 1).predictions (
TypeAliasType) – Flattened predicted probabilities in [0, 1].
- Returns:
AUC score in [0, 1].
- Return type:
- stanbkt.utils.clear_stan_cache(stan_file=None, cpp_options=None, stanc_options=None, print_fn=None)#
Clear the compiled Stan model cache.
- Parameters:
stan_file (
str|PathLike[str] |None) – If provided, only clear the cache for this specific Stan file with the given compile options. If None, clear the entire cache for all models.cpp_options (
dict[str,Any] |None) – C++ compiler options. Only used ifstan_fileis provided.stanc_options (
dict[str,Any] |None) – Stan compiler options. Only used ifstan_fileis provided.print_fn (
Optional[Callable]) – Optional function to print status messages.
- Returns:
Number of cache directories removed.
- Return type:
Examples
Clear the entire cache:
>>> clear_stan_cache()
Clear cache for a specific model:
>>> clear_stan_cache("path/to/model.stan")
Clear cache for a specific model with specific compile options:
>>> clear_stan_cache("path/to/model.stan", cpp_options={"STAN_THREADS": True})
- stanbkt.utils.compile_stan_model(stan_file, cpp_options=None, stanc_options=None, print_fn=None)#
Compile or reuse a cached Stan executable and return a CmdStan model.
- Parameters:
- Returns:
Model instance backed by a cached compiled executable.
- Return type:
- Raises:
RuntimeError – If CmdStanPy finishes compilation without reporting an executable path.
Notes
Compilation is performed inside a temporary directory so the exe is never written into the installed package tree (e.g. site-packages inside a venv). Only the compiled executable is stored in the platform cache directory; Stan source files are not cached.
CmdStanPy mutates the
stanc_options/cpp_optionsdicts it receives (addinginclude-pathsbased on the Stan file location). This function always passes shallow copies so the caller’s dicts are never modified.
- stanbkt.utils.get_cache_root()#
Return the root cache directory for all compiled Stan models.
- Returns:
Platform-specific cache root directory for StanBKT compiled models.
- Return type:
- stanbkt.utils.get_stan_model_cache_dir(stan_file, cpp_options=None, stanc_options=None)#
Return the cache directory for a compiled Stan executable.
- Parameters:
- Returns:
Platform-specific cache directory for the executable associated with the given source and compile configuration.
- Return type:
- stanbkt.utils.gq_to_draws(stan_output, data, col_mapping=None, print_fn=None)#
Convert raw CmdStanGQ outputs into long-form draw DataFrames with remapped IDs.
- Parameters:
stan_output (
dict[str,CmdStanGQ]) – Mapping from KC ID to raw CmdStanGQ objects, as returned bypredict_posterior_stanorpredict_smoothed_posterior_stan.data (
DataFrame) – The original student interaction data used for the Stan GQ call. Required to remap Stan integer indices back to actual student/problem IDs.col_mapping (
Union[Mapping[ColumnNames,str],Mapping[str,str],Mapping[ColumnNames|str,str],None]) – Column name mapping. IfNone, the standardColumnNamesdefaults are used.print_fn (
Optional[Callable[...,None]]) – Optional logging callable. Receives a message string as the first positional argument. Defaults toNone(no logging).
- Returns:
Mapping from KC ID to draw-level DataFrames suitable for
posterior_summary.- Return type:
- stanbkt.utils.list_cached_models(print_fn=None)#
List all cached Stan model directories.
- Parameters:
print_fn (
Optional[Callable]) – Optional function to print status messages.- Returns:
List of cache directory paths for compiled models.
- Return type:
Examples
>>> cached = list_cached_models() >>> print(f"Found {len(cached)} cached models")
- stanbkt.utils.posterior_summary(draws, col_mapping=None, quantiles=[0.025, 0.975], data=None)#
Summarise posterior predictions into per-observation statistics.
- Parameters:
draws (
Union[dict[str,DataFrame],dict[str,CmdStanGQ]]) – Either draw-level DataFrames (as returned bypredict_posterior_drawsorpredict_smoothed_posterior_draws), or raw CmdStanGQ objects (as returned bypredict_posterior_stanorpredict_smoothed_posterior_stan). When passing CmdStanGQ objects,datamust also be supplied.col_mapping (
Union[Mapping[ColumnNames,str],Mapping[str,str],Mapping[ColumnNames|str,str],None]) – Column name mapping. IfNone, the standardColumnNamesdefaults are used.quantiles (
list[float]) – Credible-interval quantiles to include in the summary. Each value must be in[0, 1].data (
Optional[DataFrame]) – Original student interaction data. Required whendrawscontains CmdStanGQ objects; ignored otherwise.
- Returns:
Long-form summary with mean, std, median, and the requested quantiles for pKnow and pCorrectness.
- Return type:
- stanbkt.utils.rmse(correctness, predictions)#
Compute root mean squared error between predicted probabilities and binary labels.
- Parameters:
correctness (
TypeAliasType) – Flattened binary correctness labels (0 or 1).predictions (
TypeAliasType) – Flattened predicted probabilities in [0, 1].
- Returns:
Root mean squared error.
- Return type:
- stanbkt.utils.setup_cmdstanpy(n_cores=2)#
Set up CmdStanPy by checking for CmdStan installation and setting the path if necessary.
- stanbkt.utils.sim_grouped_BKT(n_students=10, n_problems=20, n_kcs=1, n_groups=2, prior=0.1, learn=0.01, forget=0.05, guess=0.2, slip=0.1, rng_seed=None, kc_sequence=None, group_sequence=None, frac=1.0)#
Simulate student problem responses under grouped BKT model.
Generates synthetic dataset by sampling problem responses from a Bayesian Knowledge Tracing model where BKT parameters can vary by student group and by knowledge component (KC).
- Parameters:
n_students (
int) – Number of students to simulate.n_problems (
int) – Number of problems to simulate.n_kcs (
int) – Number of knowledge components (KCs).n_groups (
int) – Number of student groups with distinct BKT parameters.prior (
Union[float,Sequence[float],Sequence[Sequence[float]]]) – Initial knowledge probability. Accepted formats are scalar, shape (n_groups,), or shape (n_groups, n_kcs).learn (
Union[float,Sequence[float],Sequence[Sequence[float]]]) – Learning (mastery) probability. Accepted formats are scalar, shape (n_groups,), or shape (n_groups, n_kcs).forget (
Union[float,Sequence[float],Sequence[Sequence[float]]]) – Forgetting probability. Accepted formats are scalar, shape (n_groups,), or shape (n_groups, n_kcs).guess (
Union[float,Sequence[float],Sequence[Sequence[float]]]) – Guessing probability (correct response without knowledge). Accepted formats are scalar, shape (n_groups,), or shape (n_groups, n_kcs).slip (
Union[float,Sequence[float],Sequence[Sequence[float]]]) – Slipping probability (incorrect response despite knowledge). Accepted formats are scalar, shape (n_groups,), or shape (n_groups, n_kcs).rng_seed (int or None, optional) – Random seed for reproducibility.
kc_sequence (array-like of int or None, optional) – KC assignment for each problem. If None, randomly sampled.
group_sequence (array-like of int or None, optional) – Group assignment for each student (0-indexed). If None, students are evenly distributed across groups.
frac (float, default 1.0) – Fraction of rows to include in the output dataset. This simulates missing data, or students not completing all problems, by randomly dropping rows after simulation.
- Returns:
Simulated dataset with columns: student_id, problem_id, correct, kc_id, group_id, timestamp.
- Return type:
- Raises:
ValueError – If parameter shapes are invalid, if kc_sequence is invalid, or if group_sequence is invalid.
Notes
Parameters can be specified per-group by providing lists/arrays of length n_groups, or per-(group, KC) by providing a 2D array with shape (n_groups, n_kcs). For example:
sim_grouped_BKT( n_students=20, n_groups=2, prior=[[0.2, 0.1], [0.5, 0.4]], # rows=groups, cols=KCs learn=[[0.01, 0.02], [0.05, 0.06]], )
Each student is assigned a group, and knowledge states are tracked independently per (student, KC). Transition and emission probabilities are chosen from that student’s group and the active KC.
- stanbkt.utils.sim_simple_BKT(n_students=10, n_problems=20, n_kcs=1, prior=0.1, learn=0.01, forget=0.05, guess=0.2, slip=0.1, rng_seed=None, kc_sequence=None, frac=1.0)#
Simulate student problem responses under simple BKT model.
Generates synthetic dataset by sampling problem responses from a Bayesian Knowledge Tracing model with fixed parameters.
- Parameters:
nStudents (int, default 10) – Number of students to simulate.
nProblems (int, default 20) – Number of problems to simulate.
nKcs (int, default 1) – Number of knowledge components (KCs).
prior (
Union[float,Sequence[float]]) – Initial knowledge probability. Scalar broadcasted to all KCs or array of length nKcs.learn (
Union[float,Sequence[float]]) – Learning (mastery) probability. Scalar or array of length nKcs.forget (
Union[float,Sequence[float]]) – Forgetting probability. Scalar or array of length nKcs.guess (
Union[float,Sequence[float]]) – Guessing probability (correct response without knowledge). Scalar or array of length nKcs.slip (
Union[float,Sequence[float]]) – Slipping probability (incorrect response despite knowledge). Scalar or array of length nKcs.rng_seed (int or None, optional) – Random seed for reproducibility.
kc_sequence (array-like of int or None, optional) – KC assignment for each problem. If None, randomly sampled.
frac (float, default 1.0) – Fraction of rows to include in the output dataset. This simulates missing data, or students not completing all problems, by randomly dropping rows after simulation.
n_students (int)
n_problems (int)
n_kcs (int)
- Returns:
Simulated dataset with columns: student_id, problem_id, correct, kc_id.
- Return type:
- Raises:
ValueError – If parameter lengths do not match nKcs or if kc_sequence is invalid.
- stanbkt.utils.summarize_draws(draws, parameter_names, percentiles)#
Summarize posterior draws for each parameter.
Returns a DataFrame with one row per parameter and a stable set of descriptive columns aligned with the fit-level summary API.
- stanbkt.utils.summary_parameter_names(column_names)#
Return parameter names to include in fit summaries.
Mirrors CmdStan’s MCMC summary behavior by retaining
lp__and excluding method variables (which conventionally end with__).
- stanbkt.utils.validate_data(data, col_mapping, check_groups=False, additional_required_cols=None)#
Validate input data for BKT model fitting.
- Parameters:
data (
DataFrame) – Input data containing student interactions.col_mapping (
dict[str,str]) – Mapping of expected column names. Keys should be ‘student_id’, ‘problem_id’, ‘correct’, and ‘kc_id’. If None, default column names are used.check_groups (
bool) – Whether to check for group column in the data.
- Raises:
ValueError – If required columns are missing or if correctness values are not binary.
- Return type: