| WeiYa's Work Yard

Synthetic Instrument for Sparse Causation

June 12, 2025

This note is for Tang, D., Kong, D., & Wang, L. (2024). The synthetic instrument: From sparse association to sparse causation (No. arXiv:2304.01098). arXiv.

Identify Multiple Treatments when Unmeasured Confounding

June 02, 2025

This note is for Miao, W., Hu ,Wenjie, Ogburn ,Elizabeth L., & and Zhou, X.-H. (2023). Identifying Effects of Multiple Treatments in the Presence of Unmeasured Confounding. Journal of the American Statistical Association, 118(543), 1953–1967.

Calibrating Regression Uncertainty via σ Scaling

May 01, 2025

This note is for Laves, M.-H., Ihler, S., Fast, J. F., Kahrs, L. A., & Ortmaier, T. (2020). Well-Calibrated Regression Uncertainty in Medical Imaging with Deep Learning. Proceedings of the Third Conference on Medical Imaging with Deep Learning, 393–412.

Direct Epistemic Uncertainty Prediction

April 25, 2025

This is the note for Lahlou, S., Jain, M., Nekoei, H., Butoi, V. I., Bertin, P., Rector-Brooks, J., Korablyov, M., & Bengio, Y. (2023). DEUP: Direct Epistemic Uncertainty Prediction (No. arXiv:2102.08501). arXiv. https://doi.org/10.48550/arXiv.2102.08501

Likelihood Annealing

April 23, 2025

This note is for Upadhyay, U., Kim, J. M., Schmidt, C., Schölkopf, B., & Akata, Z. (2023). Likelihood Annealing: Fast Calibrated Uncertainty for Regression (No. arXiv:2302.11012). arXiv. https://doi.org/10.48550/arXiv.2302.11012

LLM with Conformal Inference

April 23, 2025

This note is for Cherian, J., Gibbs, I., & Candes, E. (2024, November 6). Large language model validity via enhanced conformal prediction methods. The Thirty-eighth Annual Conference on Neural Information Processing Systems.

AI Models Collapse

April 21, 2025

This note is for Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631(8022), 755–759. https://doi.org/10.1038/s41586-024-07566-y

Cauchy Combination Test

April 09, 2025

This note is for Liu, Y., & and Xie, J. (2020). Cauchy Combination Test: A Powerful Test With Analytic p-Value Calculation Under Arbitrary Dependency Structures. Journal of the American Statistical Association, 115(529), 393–402.

TabDDPM: Tabular Data with Diffusion Models

April 05, 2025

This note is for Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). TabDDPM: Modelling Tabular Data with Diffusion Models. Proceedings of the 40th International Conference on Machine Learning, 17564–17579.

PolyLoss

February 23, 2025

This note is for Leng, Z., Tan, M., Liu, C., Cubuk, E. D., Shi, X., Cheng, S., & Anguelov, D. (2022). PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions (No. arXiv:2204.12511). arXiv. https://doi.org/10.48550/arXiv.2204.12511

Entropy Regularization

February 23, 2025

This note is for Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., & Hinton, G. (2017). Regularizing Neural Networks by Penalizing Confident Output Distributions (No. arXiv:1701.06548). arXiv. https://doi.org/10.48550/arXiv.1701.06548

Label Smoothing

February 22, 2025

This note is for Müller, R., Kornblith, S., & Hinton, G. E. (2019). When does label smoothing help? Advances in Neural Information Processing Systems, 32.

Focal Loss

February 22, 2025

This post is for Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P., & Dokania, P. (2020). Calibrating Deep Neural Networks using Focal Loss. Advances in Neural Information Processing Systems, 33, 15288–15299.

Finite Mixture Models

February 11, 2025

This note is for Wang, H., Ibrahim, S., & Mazumder, R. (2023). Nonparametric Finite Mixture Models with Possible Shape Constraints: A Cubic Newton Approach (No. arXiv:2107.08535). arXiv. https://doi.org/10.48550/arXiv.2107.08535

Simple Test-Time Scaling

February 07, 2025

This note is for Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., & Hashimoto, T. (2025). s1: Simple test-time scaling (No. arXiv:2501.19393). arXiv. https://doi.org/10.48550/arXiv.2501.19393

Approximations for KL between GMMs

February 07, 2025

This note is based on Hershey, J. R., & Olsen, P. A. (2007). Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 4, IV-317-IV–320. https://doi.org/10.1109/ICASSP.2007.366913

Clustering of High-dim GMMs with EM

February 06, 2025

This post is for Cai, T. T., Ma, J., & Zhang, L. (2019). CHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics, 47(3), 1234–1267. https://doi.org/10.1214/18-AOS1711

Statistical Review on Variational Inference

February 06, 2025

This note is for Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859–877. https://doi.org/10.1080/01621459.2017.1285773

Over-specified EM for GMMs

February 04, 2025

This post is for Dwivedi, R., Ho, N., Khamaru, K., Wainwright, M. J., Jordan, M. I., & Yu, B. (2020). Singularity, Misspecification and the Convergence Rate of Em. The Annals of Statistics, 48(6), 3161–3182.

kmeans++ for Careful Seeding

January 28, 2025

This note is for Arthur, D., & Vassilvitskii, S. (2006). k-means++: The Advantages of Careful Seeding. Stanford, 11.

Sample Difficulty from Pre-trained Models

January 20, 2025

This note is for Cui, P., Zhang, D., Deng, Z., Dong, Y., & Zhu, J. (2023). Learning Sample Difficulty from Pre-trained Models for Reliable Prediction. Advances in Neural Information Processing Systems, 36, 25390–25408.

Concrete Distribution: Relaxation of Discrete Random Variables

January 17, 2025

This note is for Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables (No. arXiv:1611.00712). arXiv. http://arxiv.org/abs/1611.00712

PromptBench: Evaluation of Large Language Models

January 17, 2025

This note is for Zhu, K., Zhao, Q., Chen, H., Wang, J., & Xie, X. (2024). PromptBench: A Uniﬁed Library for Evaluation of Large Language Models.

FDR Control for Byzantine Machines

December 13, 2024

This is the note for Qian, C., Wang, M., Ren, H., & Zou, C. (2024). ByMI: Byzantine Machine Identification with False Discovery Rate Control. Proceedings of the 41st International Conference on Machine Learning, 41357–41382. https://proceedings.mlr.press/v235/qian24b.html

FDR Control under General Dependence via Symmetrization

December 13, 2024

This note is for Du, L., Guo, X., Sun, W., & Zou, C. (2023). False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation. Journal of the American Statistical Association, 118(541), 607–621. https://doi.org/10.1080/01621459.2021.1945459

Personalized Federated Learning with Robust and Sparse Regressions

December 10, 2024 (Update: December 12, 2024)

This note is for Liu, W., Mao, X., Zhang, X., & Zhang, X. (2024). Robust Personalized Federated Learning with Sparse Penalization. Journal of the American Statistical Association, 0(0), 1–12. https://doi.org/10.1080/01621459.2024.2321652

Biomarker Variability in Joint Model

December 10, 2024

This note is for Wang, C., Shen, J., Charalambous, C., & Pan, J. (2024). Modeling biomarker variability in joint analysis of longitudinal and time-to-event data. Biostatistics, 25(2), 577–596. https://doi.org/10.1093/biostatistics/kxad009 and Wang, C., Shen, J., Charalambous, C., & Pan, J. (2024). Weighted biomarker variability in joint analysis of longitudinal and time-to-event data. The Annals of Applied Statistics, 18(3), 2576–2595. https://doi.org/10.1214/24-AOAS1896

Derandomised Knockoffs from E-values

December 09, 2024

This note is for Ren, Z., & Barber, R. F. (2024). Derandomised knockoffs: Leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1), 122–154. https://doi.org/10.1093/jrsssb/qkad085

Derandomised Knockoffs from E-values

December 05, 2024

This note is for Wang, R., & Ramdas, A. (2022). False Discovery Rate Control with E-values. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3), 822–852. https://doi.org/10.1111/rssb.12489 and Aaditya’s talk at ISSI on October 25, 2023

Boosting Data Analytics with Synthetic Data

December 03, 2024 (Update: April 21, 2025)

This post is for Shen, X., Liu, Y., & Shen, R. (2024). Boosting Data Analytics With Synthetic Volume Expansion (No. arXiv:2310.17848). arXiv. https://doi.org/10.48550/arXiv.2310.17848

Task-Agnostic Machine-Learning-Assisted Inference

November 22, 2024

This note is for Miao, J., & Lu, Q. (2024). Task-Agnostic Machine-Learning-Assisted Inference (No. arXiv:2405.20039). arXiv. https://doi.org/10.48550/arXiv.2405.20039

Review on Normalizing Flows

November 22, 2024 (Update: November 23, 2024)

This note is for Kobyzev, I., Prince, S. J. D., & Brubaker, M. A. (2020). Normalizing Flows: An Introduction and Review of Current Methods (No. arXiv:1908.09257). arXiv. http://arxiv.org/abs/1908.09257

C-SIDE for Cell-type-specific Spatial DE

November 12, 2024 (Update: November 23, 2024)

This note is for Cable, D. M., Murray, E., Shanmugam, V., Zhang, S., Zou, L. S., Diao, M., Chen, H., Macosko, E. Z., Irizarry, R. A., & Chen, F. (2022). Cell type-specific inference of differential expression in spatial transcriptomics. Nature Methods, 19(9), 1076–1087. https://doi.org/10.1038/s41592-022-01575-3

spaCRT: saddlepoint approximation-based conditional randomization test

November 04, 2024

This note is for Niu, Z., Choudhury, J. R., & Katsevich, E. (2024). Computationally efficient and statistically accurate conditional independence testing with spaCRT (No. arXiv:2407.08911; Version 1). arXiv. https://doi.org/10.48550/arXiv.2407.08911

Benchopt: Benchmarks for ML Optimizations

November 01, 2024 (Update: November 04, 2024)

This is the note for Moreau, T., Massias, M., Gramfort, A., Ablin, P., Bannier, P.-A., Charlier, B., Dagréou, M., Tour, T. D. la, Durif, G., Dantas, C. F., Klopfenstein, Q., Larsson, J., Lai, E., Lefort, T., Malézieux, B., Moufad, B., Nguyen, B. T., Rakotomamonjy, A., Ramzi, Z., … Vaiter, S. (2022). Benchopt: Reproducible, efficient and collaborative optimization benchmarks (No. arXiv:2206.13424). arXiv. https://doi.org/10.48550/arXiv.2206.13424

XBART: Accelerated Bayesian Additive Regression Trees

October 04, 2024

This post is based on He, J., Yalov, S., & Hahn, P. R. (2019). XBART: Accelerated Bayesian Additive Regression Trees. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 1130–1138. https://proceedings.mlr.press/v89/he19a.html and He, J., & Hahn, P. R. (2023). Stochastic Tree Ensembles for Regularized Nonlinear Regression. Journal of the American Statistical Association, 118(541), 551–570. https://doi.org/10.1080/01621459.2021.1942012

scDRS: single-cell disease relevance score

September 10, 2024 (Update: September 14, 2024)

This note is for Zhang, M. J., Hou, K., Dey, K. K., Sakaue, S., Jagadeesh, K. A., Weinand, K., Taychameekiatchai, A., Rao, P., Pisco, A. O., Zou, J., Wang, B., Gandal, M., Raychaudhuri, S., Pasaniuc, B., & Price, A. L. (2022). Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data. Nature Genetics, 54(10), 1572–1580. https://doi.org/10.1038/s41588-022-01167-z

Guarantees of Lloyd’s Algorithm

September 10, 2024 (Update: September 14, 2024)

This note is for Lu, Y., & Zhou, H. H. (2016). Statistical and Computational Guarantees of Lloyd’s Algorithm and its Variants (No. arXiv:1612.02099). arXiv. http://arxiv.org/abs/1612.02099

Data Thinning for Convolution-Closed Distributions

August 29, 2024

This note is for Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data Thinning for Convolution-Closed Distributions. Journal of Machine Learning Research, 25(57), 1–35.

Data Fission

August 05, 2024 (Update: July 06, 2025)

This note is for the discussion paper Leiner, J., Duan, B., Wasserman, L., & Ramdas, A. (2023). Data fission: Splitting a single data point (arXiv:2112.11079). arXiv. http://arxiv.org/abs/2112.11079 in the JASA invited session at JSM 2024

Watermarks in Large Language Models

August 05, 2024 (Update: August 16, 2024)

This is the note for the talk Statistical Inference in Large Language Models: A Statistical Framework of Watermarks given by Weijie Su at JSM 2024

Training in Large Language Models

August 05, 2024 (Update: August 16, 2024)

This is the note for the talk LLMs training given by Linjun Zhang at JSM 2024

Perference Matching in RLHF

August 05, 2024 (Update: August 16, 2024)

This is the note for the talk Statistical Inference in Large Language Models: Alignment and Copyright given by Weijie Su at JSM 2024

Talagrand Concentration

July 30, 2024 (Update: August 16, 2024)

This note is for Wainwright, M. J. (n.d.). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. 604.

Approximating Bayes

July 04, 2024 (Update: August 16, 2024)

This is the note for Martin, G. M., Frazier, D. T., & Robert, C. P. (2024). Approximating Bayes in the 21st Century. Statistical Science, 39(1), 20–45. https://doi.org/10.1214/22-STS875

Conformal Prediction for Single-cell Spatial Transcriptomics

June 07, 2024

This note is for Sun, E. D., Ma, R., Navarro Negredo, P., Brunet, A., & Zou, J. (2024). TISSUE: Uncertainty-calibrated prediction of single-cell spatial transcriptomics improves downstream analyses. Nature Methods, 21(3), 444–454.

GhostKnockoffs: Only Summary Statistics

May 23, 2024 (Update: June 24, 2025)

This note is for Chen, Z., He, Z., Chu, B. B., Gu, J., Morrison, T., Sabatti, C., & Candès, E. (2024). Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression (arXiv:2402.12724). arXiv.

Niche DE

April 30, 2024

This note is for Mason, K., Sathe, A., Hess, P. R., Rong, J., Wu, C.-Y., Furth, E., Susztak, K., Levinsohn, J., Ji, H. P., & Zhang, N. (2024). Niche-DE: Niche-differential gene expression analysis in spatial transcriptomics data identifies context-dependent cell-cell interactions. Genome Biology, 25(1), 14.

Model-X Knockoffs

April 20, 2024 (Update: May 08, 2024)

This note is for Candes, E., Fan, Y., Janson, L., & Lv, J. (2017). Panning for Gold: Model-X Knockoffs for High-dimensional Controlled Variable Selection. arXiv:1610.02351 [Math, Stat].

Conditional Independence Test in Single-cell Multiomics

April 17, 2024

This note is for Boyeau, P., Bates, S., Ergen, C., Jordan, M. I., & Yosef, N. (2023). Calibrated Identification of Feature Dependencies in Single-cell Multiomics.

See all posts →

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

June 12, 2025

June 02, 2025

May 01, 2025

April 25, 2025

April 23, 2025

April 23, 2025

April 21, 2025

April 09, 2025

April 05, 2025

February 23, 2025

February 23, 2025

February 22, 2025

February 22, 2025

February 11, 2025

February 07, 2025

February 07, 2025

February 06, 2025

February 06, 2025

February 04, 2025

January 28, 2025

January 20, 2025

January 17, 2025

January 17, 2025

December 13, 2024

December 13, 2024

December 10, 2024 (Update: December 12, 2024)

December 10, 2024

December 09, 2024

December 05, 2024

December 03, 2024 (Update: April 21, 2025)

November 22, 2024

November 22, 2024 (Update: November 23, 2024)

November 12, 2024 (Update: November 23, 2024)

November 04, 2024

November 01, 2024 (Update: November 04, 2024)

October 04, 2024

September 10, 2024 (Update: September 14, 2024)

September 10, 2024 (Update: September 14, 2024)

August 29, 2024

August 05, 2024 (Update: July 06, 2025)

August 05, 2024 (Update: August 16, 2024)

August 05, 2024 (Update: August 16, 2024)

August 05, 2024 (Update: August 16, 2024)

July 30, 2024 (Update: August 16, 2024)

July 04, 2024 (Update: August 16, 2024)

June 07, 2024

May 23, 2024 (Update: June 24, 2025)

April 30, 2024

April 20, 2024 (Update: May 08, 2024)

April 17, 2024