poco logo


poco (POsterior COmpression, “little” in Italian) provides a small, unified API for compressing MCMC posterior draws into compact density approximations (Gaussian mixture models, kernel density estimates), regenerating samples on demand, and evaluating the density at arbitrary points.

It is designed for situations where you need to store, ship, or post-process many Bayesian models with limited disk space - for example when caching hundreds of brms or sccomp fits, or sharing pre-trained posteriors with collaborators. With Gaussian-mixture compression, the on-disk footprint can drop by 100-1000x while preserving correlation structure, multimodality and most downstream summaries.

Why?

A Stan model with 1000 parameters x 8000 draws x 8 bytes = ~64 MB per chain. With four chains that is 256 MB, and 100 such models is 25 GB.

A 10-component Gaussian mixture in 1000 dimensions stores only weights + means + covariances, typically a few hundred KB.

Installation

# install.packages("remotes")
remotes::install_github("stemangiola/poco")

brms / cmdstanr / sccomp are Suggested dependencies; you only need them if you plan to compress fits from those packages.

Quick tour

poco exposes three entry points that all share the same method argument:

Function Input
compress_posterior() a draws matrix / data.frame / posterior::draws_*
compress_fit() a cmdstanr::CmdStanMCMC or vector of CSV paths
compress_brmsfit() a brms::brmsfit (cmdstanr backend)
compress_sccomp() an sccomp fit (cmdstanr backend)

All return an S3 object with class posterior_compressed, which is consumed by:

From a draws matrix

library(poco)

set.seed(1)
draws <- matrix(rnorm(2000 * 3), ncol = 3,
                dimnames = list(NULL, c("alpha", "beta", "sigma")))

comp <- compress_posterior(
  draws,
  method = "mclust",        # or "mvdens_gmm", "mvdens_kde"
  n_components = 5
)

new_draws <- sample_posterior(comp, n_draws = 4000)
density_posterior(comp, x = matrix(c(0, 0, 1), nrow = 1))

saveRDS(comp, "model.rds", compress = "xz")

From a brms fit

library(brms)
fit <- brm(y ~ x + (1 | g), data = dat, backend = "cmdstanr")

result <- compress_brmsfit(fit, method = "mclust", n_components = 5)

saveRDS(result$compressed, "model_compressed.rds", compress = "xz")
saveRDS(result$structure,  "model_structure.rds")

# Later, reload + reconstruct + use brms machinery as normal:
result_loaded <- list(
  compressed = readRDS("model_compressed.rds"),
  structure  = readRDS("model_structure.rds")
)
fit_recon <- reconstruct_brmsfit(result_loaded)

predict(fit_recon, newdata = new_data)
posterior_predict(fit_recon)

From a cmdstanr fit (or CSV files)

library(cmdstanr)
fit <- mod$sample(data = stan_data, chains = 4)

comp <- compress_fit(fit, method = "mclust", n_components = 5)
new_draws <- sample_posterior(comp, n_draws = 4000)

Methods

method Backend Notes
"mclust" mclust::Mclust() parametric GMM, BIC-based selection, fast & compact
"mvdens_gmm" mvdens::fit.gmm() EM-based GMM, requires the suggested mvdens package
"mvdens_kde" mvdens::fit.kde() kernel density estimate, larger but non-parametric

compression_methods() returns the current set.

How much information was lost?

Like JPEG quality scores, evaluate_compression() reports a reproduction % for the compressed posterior. It uses two backend-independent two-sample diagnostics on samples regenerated from the compressed object versus the reference draws — so the score is not circular (it doesn’t reuse the same density model that did the compression):

  • Energy distance (Szekely & Rizzo) anchored against a bootstrap Monte Carlo noise envelope from the reference itself.
  • Classifier two-sample test (C2ST) — a random forest via ranger::ranger() under cross-validation tries to tell reference draws from reconstructed draws; AUC of 0.5 means indistinguishable. Pass classifier = "knn" for a k-NN alternative instead of the forest.
fidelity <- evaluate_compression(comp, reference_draws = draws)
fidelity
#> <compression_fidelity>
#>   method        : mclust
#>   parameters    : 3
#>   reference n   : 1500
#>   eval n        : 1500
#>   ----------------------------------------
#>   energy        : 100.0% reproduction
#>     distance    : 0.054   noise envelope (q90): 0.062   ratio: 0.87x
#>   C2ST          :  97.7% reproduction
#>     AUC         : 0.511   classifier: ranger   cv_folds: 5
#>   ----------------------------------------
#>   reproduction  : 98.9%

For a strict out-of-sample evaluation, hold the reference draws out before fitting:

idx  <- sample.int(nrow(draws), 0.8 * nrow(draws))
comp <- compress_posterior(draws[idx, ], method = "mclust")
evaluate_compression(comp, reference_draws = draws[-idx, ])

Vignettes

License

GPL-3