WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Flexible Imbalance-Faireness-Aware Classifiers

October 08, 2025

This note is for Deng, Z., Zhang, J., Zhang, L., Ye, T., Coley, Y., Su, W. J., & Zou, J. (2022). FIFA: Making Fairness More Generalizable in Classifiers Trained on Imbalanced Data (No. arXiv:2206.02792). arXiv.

Continue reading



Fair Risk Control for Calibrating Multi-group Fairness Risks

October 06, 2025

This note is for Zhang, L., Roth, A., & Zhang, L. (2024). Fair Risk Control: A Generalized Framework for Calibrating Multi-group Fairness Risks. Proceedings of the 41st International Conference on Machine Learning, 59783–59805.

Continue reading



Robust Detection of Watermarks for LLM under Human Edits

October 05, 2025

This note is for Li, X., Ruan, F., Wang, H., Long, Q., & Su, W. J. (2025). Robust detection of watermarks for large language models under human edits. Journal of the Royal Statistical Society Series B.

Continue reading



Global and Local Correlations under Spatial Autocorrelation

October 04, 2025

This note is for Viladomat, J., Mazumder, R., McInturff, A., McCauley, D. J., & Hastie, T. (2014). Assessing the significance of global and local correlations under spatial autocorrelation: A nonparametric approach. Biometrics, 70(2), 409–418.

Continue reading



Generalized Estimating Equations for Differentially Expressed Genes in Spatial Transcriptomics

October 03, 2025

This note is for Wang, Y., Zang, C., Li, Z., Guo, C. C., Lai, D., & Wei, P. (2025). A comparative study of statistical methods for identifying differentially expressed genes in spatial transcriptomics (p. 2025.02.17.638726). bioRxiv.

  • seurat, by far the most popular tool for analyzing ST data, uses the Wilcoxon rank-sum test by default for differential expression analysis
  • the paper proposes a Generalized Score Test (GST) in the Generalized Estimating Equations (GEEs) framework as a robust solution for differential gene expression analysis in ST.

Introduction

a common task in ST data analysis involves identifying DE genes across pathological grades

this study considered two potential approaches: generalized linear mixed model (GLMM) and generalized estimating equations (GEE)

2.2 Generalized Linear Mixed Model (GLMM)

Let $Y_{ij}$ represent the gene expression count for gene at spatial location $i$.

  • $X_i$: binary dummy variable for pathology grade at location $i$
  • spatial coordinates $s_i = (s_{i1}, s_{i2})$ for spatial location $i$

the model is specified as

\[\log \mu_{ij} = X_i^T\beta + \varepsilon_{ij}\]

where

  • $\mu_{ij}$ is the expected count for gene $j$ at location $i$
  • $\beta = [\beta_{j0}, \beta_{j1}]^T$ is the vector of fixed effect coefficients, where $\beta_{j0}$ is the intercept and $\beta_{j1}$ is the effect of Grade B compared to Grade A
  • the random effect $\varepsilon_{ij}$ is assumed to follow a normal distribution, $\varepsilon_{ij}\sim N(0, V(\sigma_j^2, \kappa_j, \tau))$, representing the random effect for gene $j$ at the $i$-th spatial location

for a gievn gene $j$, the spatial covariance matrix $V(\sigma_j^2, k_j, \tau)$ is defined based on the distances between pairs of spatial locations

the $(i, i’)$-th element of $V(\sigma_j^2, k_j, \tau)$ is given by

\[V_{ii'}(\sigma_j^2, k_j, \tau) = \sigma_j^2R(\tau, k_j) = \sigma_j^2 \exp(-\tau_{ii'}/k_j)\]
  • $\tau_{ii’} = \Vert s_i - s_{i’}\Vert$ denotes the Euclidean distance between two spatial locations $i$ and $i’$
  • $k_j > 0$ is a parameter that determines the rate of decay in correlation with distance, with larger values of $k_j$ indicating stronger correlations and smaller semi-variances
  • the exponential spatial structure here is a specific case of the Matern correlation structure $R(\tau)$

test for DE genes across the two pathology grades, test

\[H_0: \beta_{j1} = 0\qquad H_a: \beta_{j1} \neq 0\]

2.3 Generalized Estimating Equations (GEE)

instead of explicitly modeling the spatial correlation structure using random effects, the GEE use a “working” correlation matrix to account for the spatial dependence between observations

the paper adopted the GEEs model with an independent working correlation structure by dividing the whose ST tissue into $m$ spatial clusters using $K$-means clustering.

the mean of $Y_{ij}$, denoted by $\mu_{ij}$, is linked to the covariates through a log link function, and the model is specified as

\[\log\mu_{ij} = X_i^T\beta\]

where $\beta = [\beta_{j0}, \beta_{j1}]$ is the vector of fixed effect coefficients.

The parameters $\beta$ are estimated by solving the GEE:

\[\sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i) = 0\]

where

  • $D_i$ is the derivative of the mean response with respect to $\beta$ in the cluster $i$: $D_i = \frac{\partial \mu_i}{\partial \beta}$
  • $V_i$ is the variance-covariance matrix of responses in the cluster $i$, which is a function of the working correlation matrix $R_i$ ($V_i = C_i^{1/2}R_iC_i^{1/2}$)
  • $C_i$ is the diagonal matrix that includes the variances of the individual observations within the cluster $i$

the robust variance estimate for the estimated coefficients is calculated by the sandwich estimator

\[\widehat{Var}(\hat\beta) = \left(\sum_{i=1}^m D_i^TV_i^{-1}D_i \right)^{-1}\left(\sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i)(Y_i - \mu_i)^TV_i^{-1}D_i\right)\left(\sum_{i=1}^m D_i^TV_i^{-1}D_i\right)^{-1}\]

the advantages of the GEE framework is that it produces consistent and asymptotically normal estimates of the parameters even if the working correlation structure is incorrectly specified

2.3.1 Robust Wald Test

the robust Wald test statistic $W$ is computed as

\[W = \frac{\hat\beta_{j1}}{\widehat{se}(\hat\beta_{j1})}\]

2.3.2 Generalized Score Test (GST)

The asymptomatic equivalence between the GST statistic and the robust Wald statistic holds only as the number of spatial clusters is large.

the score function $U(\hat\beta_0)$ is the derivative of the quasi-likelihood with respect to the parameter $\beta$:

\[U(\hat\beta_0) = \sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i)\]

The GST statistic $S$ is computed as

\[S = \frac{U(\hat\beta_{j1})}{\widehat{se}(U(\hat\beta_{j1}))}\]

3. Results

3.1 Simulation Studies

  1. fit the GLMM to a breast cancer ST dataset from 10x Genomics. Two key spatial parameters:
    • $\sigma^2$: spatial variance that captures variability in gene expression across different spatial locations
    • $\kappa$: spatial correlation that controls the rate of decay in correlation with distance between spatial locations
  2. use the estimated parameters to generate simulated data

3.2 Real Data

3.2.1 Breast Cancer ST dataset

3798 spots and 24923 genes, retaining the 3000 most variable genes

3.2.2 Prostate Cancer ST dataset

4371 spots and 16907 genes, retaining the 3000 most variable genes

4. Discussion

Continue reading



Debiasing Watermarks via Maximal Coupling

October 03, 2025

This note is for Xie, Y., Li, X., Mallick, T., Su, W., & Zhang, R. (2025). Debiasing Watermarks for Large Language Models via Maximal Coupling. Journal of the American Statistical Association, 0(0), 1–11.

Continue reading



Post-selection inference via algorithmic stability

October 02, 2025

This note is for the paper Zrnic, T., & Jordan, M. I. (2023). Post-selection inference via algorithmic stability. The Annals of Statistics, 51(4), 1666–1691..

Continue reading



Scaling Laws for Surrogate Data

September 04, 2025 (Update: )

This note is for Jain, A., Montanari, A., & Sasoglu, E. (2024, November 6). Scaling laws for learning with real and surrogate data. The Thirty-eighth Annual Conference on Neural Information Processing Systems.

Continue reading



Generative Models via Transfer Learning

July 22, 2025

this note is for Tian, X., & Shen, X. (2025). Enhancing Accuracy in Generative Models via Knowledge Transfer (No. arXiv:2405.16837). arXiv.

Continue reading



Synthetic Instrument for Sparse Causation

June 12, 2025

This note is for Tang, D., Kong, D., & Wang, L. (2024). The synthetic instrument: From sparse association to sparse causation (No. arXiv:2304.01098). arXiv.

Continue reading



Identify Multiple Treatments when Unmeasured Confounding

June 02, 2025

This note is for Miao, W., Hu ,Wenjie, Ogburn ,Elizabeth L., & and Zhou, X.-H. (2023). Identifying Effects of Multiple Treatments in the Presence of Unmeasured Confounding. Journal of the American Statistical Association, 118(543), 1953–1967.

Continue reading



Calibrating Regression Uncertainty via σ Scaling

May 01, 2025

This note is for Laves, M.-H., Ihler, S., Fast, J. F., Kahrs, L. A., & Ortmaier, T. (2020). Well-Calibrated Regression Uncertainty in Medical Imaging with Deep Learning. Proceedings of the Third Conference on Medical Imaging with Deep Learning, 393–412.

Continue reading



Direct Epistemic Uncertainty Prediction

April 25, 2025

This is the note for Lahlou, S., Jain, M., Nekoei, H., Butoi, V. I., Bertin, P., Rector-Brooks, J., Korablyov, M., & Bengio, Y. (2023). DEUP: Direct Epistemic Uncertainty Prediction (No. arXiv:2102.08501). arXiv. https://doi.org/10.48550/arXiv.2102.08501

Continue reading



Likelihood Annealing

April 23, 2025

This note is for Upadhyay, U., Kim, J. M., Schmidt, C., Schölkopf, B., & Akata, Z. (2023). Likelihood Annealing: Fast Calibrated Uncertainty for Regression (No. arXiv:2302.11012). arXiv. https://doi.org/10.48550/arXiv.2302.11012

Continue reading



LLM with Conformal Inference

April 23, 2025

This note is for Cherian, J., Gibbs, I., & Candes, E. (2024, November 6). Large language model validity via enhanced conformal prediction methods. The Thirty-eighth Annual Conference on Neural Information Processing Systems.

Continue reading



AI Models Collapse

April 21, 2025

This note is for Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631(8022), 755–759. https://doi.org/10.1038/s41586-024-07566-y

Continue reading



Cauchy Combination Test

April 09, 2025

This note is for Liu, Y., & and Xie, J. (2020). Cauchy Combination Test: A Powerful Test With Analytic p-Value Calculation Under Arbitrary Dependency Structures. Journal of the American Statistical Association, 115(529), 393–402.

Continue reading



TabDDPM: Tabular Data with Diffusion Models

April 05, 2025

This note is for Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). TabDDPM: Modelling Tabular Data with Diffusion Models. Proceedings of the 40th International Conference on Machine Learning, 17564–17579.

Continue reading



PolyLoss

February 23, 2025

This note is for Leng, Z., Tan, M., Liu, C., Cubuk, E. D., Shi, X., Cheng, S., & Anguelov, D. (2022). PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions (No. arXiv:2204.12511). arXiv. https://doi.org/10.48550/arXiv.2204.12511

Continue reading



Entropy Regularization

February 23, 2025

This note is for Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., & Hinton, G. (2017). Regularizing Neural Networks by Penalizing Confident Output Distributions (No. arXiv:1701.06548). arXiv. https://doi.org/10.48550/arXiv.1701.06548

Continue reading



Label Smoothing

February 22, 2025

This note is for Müller, R., Kornblith, S., & Hinton, G. E. (2019). When does label smoothing help? Advances in Neural Information Processing Systems, 32.

Continue reading



Focal Loss

February 22, 2025

This post is for Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P., & Dokania, P. (2020). Calibrating Deep Neural Networks using Focal Loss. Advances in Neural Information Processing Systems, 33, 15288–15299.

Continue reading



Finite Mixture Models

February 11, 2025

This note is for Wang, H., Ibrahim, S., & Mazumder, R. (2023). Nonparametric Finite Mixture Models with Possible Shape Constraints: A Cubic Newton Approach (No. arXiv:2107.08535). arXiv. https://doi.org/10.48550/arXiv.2107.08535

Continue reading



Simple Test-Time Scaling

February 07, 2025

This note is for Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., & Hashimoto, T. (2025). s1: Simple test-time scaling (No. arXiv:2501.19393). arXiv. https://doi.org/10.48550/arXiv.2501.19393

Continue reading



Approximations for KL between GMMs

February 07, 2025

This note is based on Hershey, J. R., & Olsen, P. A. (2007). Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 4, IV-317-IV–320. https://doi.org/10.1109/ICASSP.2007.366913

Continue reading



Clustering of High-dim GMMs with EM

February 06, 2025

This post is for Cai, T. T., Ma, J., & Zhang, L. (2019). CHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics, 47(3), 1234–1267. https://doi.org/10.1214/18-AOS1711

Continue reading



Statistical Review on Variational Inference

February 06, 2025

This note is for Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859–877. https://doi.org/10.1080/01621459.2017.1285773

Continue reading



Over-specified EM for GMMs

February 04, 2025

This post is for Dwivedi, R., Ho, N., Khamaru, K., Wainwright, M. J., Jordan, M. I., & Yu, B. (2020). Singularity, Misspecification and the Convergence Rate of Em. The Annals of Statistics, 48(6), 3161–3182.

Continue reading



kmeans++ for Careful Seeding

January 28, 2025

This note is for Arthur, D., & Vassilvitskii, S. (2006). k-means++: The Advantages of Careful Seeding. Stanford, 11.

Continue reading



Sample Difficulty from Pre-trained Models

January 20, 2025

This note is for Cui, P., Zhang, D., Deng, Z., Dong, Y., & Zhu, J. (2023). Learning Sample Difficulty from Pre-trained Models for Reliable Prediction. Advances in Neural Information Processing Systems, 36, 25390–25408.

Continue reading



Concrete Distribution: Relaxation of Discrete Random Variables

January 17, 2025

This note is for Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables (No. arXiv:1611.00712). arXiv. http://arxiv.org/abs/1611.00712

Continue reading



PromptBench: Evaluation of Large Language Models

January 17, 2025

This note is for Zhu, K., Zhao, Q., Chen, H., Wang, J., & Xie, X. (2024). PromptBench: A Unified Library for Evaluation of Large Language Models.

Continue reading



FDR Control for Byzantine Machines

December 13, 2024

This is the note for Qian, C., Wang, M., Ren, H., & Zou, C. (2024). ByMI: Byzantine Machine Identification with False Discovery Rate Control. Proceedings of the 41st International Conference on Machine Learning, 41357–41382. https://proceedings.mlr.press/v235/qian24b.html

Continue reading



FDR Control under General Dependence via Symmetrization

December 13, 2024

This note is for Du, L., Guo, X., Sun, W., & Zou, C. (2023). False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation. Journal of the American Statistical Association, 118(541), 607–621. https://doi.org/10.1080/01621459.2021.1945459

Continue reading



Personalized Federated Learning with Robust and Sparse Regressions

December 10, 2024 (Update: )

This note is for Liu, W., Mao, X., Zhang, X., & Zhang, X. (2024). Robust Personalized Federated Learning with Sparse Penalization. Journal of the American Statistical Association, 0(0), 1–12. https://doi.org/10.1080/01621459.2024.2321652

Continue reading



Biomarker Variability in Joint Model

December 10, 2024

This note is for Wang, C., Shen, J., Charalambous, C., & Pan, J. (2024). Modeling biomarker variability in joint analysis of longitudinal and time-to-event data. Biostatistics, 25(2), 577–596. https://doi.org/10.1093/biostatistics/kxad009 and Wang, C., Shen, J., Charalambous, C., & Pan, J. (2024). Weighted biomarker variability in joint analysis of longitudinal and time-to-event data. The Annals of Applied Statistics, 18(3), 2576–2595. https://doi.org/10.1214/24-AOAS1896

Continue reading



Derandomised Knockoffs from E-values

December 09, 2024

This note is for Ren, Z., & Barber, R. F. (2024). Derandomised knockoffs: Leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1), 122–154. https://doi.org/10.1093/jrsssb/qkad085

Continue reading



Derandomised Knockoffs from E-values

December 05, 2024

This note is for Wang, R., & Ramdas, A. (2022). False Discovery Rate Control with E-values. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3), 822–852. https://doi.org/10.1111/rssb.12489 and Aaditya’s talk at ISSI on October 25, 2023

Continue reading



Boosting Data Analytics with Synthetic Data

December 03, 2024 (Update: )

This post is for Shen, X., Liu, Y., & Shen, R. (2024). Boosting Data Analytics With Synthetic Volume Expansion (No. arXiv:2310.17848). arXiv. https://doi.org/10.48550/arXiv.2310.17848

Continue reading



Task-Agnostic Machine-Learning-Assisted Inference

November 22, 2024

This note is for Miao, J., & Lu, Q. (2024). Task-Agnostic Machine-Learning-Assisted Inference (No. arXiv:2405.20039). arXiv. https://doi.org/10.48550/arXiv.2405.20039

Continue reading



Review on Normalizing Flows

November 22, 2024 (Update: )

This note is for Kobyzev, I., Prince, S. J. D., & Brubaker, M. A. (2020). Normalizing Flows: An Introduction and Review of Current Methods (No. arXiv:1908.09257). arXiv. http://arxiv.org/abs/1908.09257

Continue reading



C-SIDE for Cell-type-specific Spatial DE

November 12, 2024 (Update: )

This note is for Cable, D. M., Murray, E., Shanmugam, V., Zhang, S., Zou, L. S., Diao, M., Chen, H., Macosko, E. Z., Irizarry, R. A., & Chen, F. (2022). Cell type-specific inference of differential expression in spatial transcriptomics. Nature Methods, 19(9), 1076–1087. https://doi.org/10.1038/s41592-022-01575-3

Continue reading



spaCRT: saddlepoint approximation-based conditional randomization test

November 04, 2024

This note is for Niu, Z., Choudhury, J. R., & Katsevich, E. (2024). Computationally efficient and statistically accurate conditional independence testing with spaCRT (No. arXiv:2407.08911; Version 1). arXiv. https://doi.org/10.48550/arXiv.2407.08911

Continue reading



Benchopt: Benchmarks for ML Optimizations

November 01, 2024 (Update: )

This is the note for Moreau, T., Massias, M., Gramfort, A., Ablin, P., Bannier, P.-A., Charlier, B., Dagréou, M., Tour, T. D. la, Durif, G., Dantas, C. F., Klopfenstein, Q., Larsson, J., Lai, E., Lefort, T., Malézieux, B., Moufad, B., Nguyen, B. T., Rakotomamonjy, A., Ramzi, Z., … Vaiter, S. (2022). Benchopt: Reproducible, efficient and collaborative optimization benchmarks (No. arXiv:2206.13424). arXiv. https://doi.org/10.48550/arXiv.2206.13424

Continue reading



XBART: Accelerated Bayesian Additive Regression Trees

October 04, 2024

This post is based on He, J., Yalov, S., & Hahn, P. R. (2019). XBART: Accelerated Bayesian Additive Regression Trees. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 1130–1138. https://proceedings.mlr.press/v89/he19a.html and He, J., & Hahn, P. R. (2023). Stochastic Tree Ensembles for Regularized Nonlinear Regression. Journal of the American Statistical Association, 118(541), 551–570. https://doi.org/10.1080/01621459.2021.1942012

Continue reading



scDRS: single-cell disease relevance score

September 10, 2024 (Update: )

This note is for Zhang, M. J., Hou, K., Dey, K. K., Sakaue, S., Jagadeesh, K. A., Weinand, K., Taychameekiatchai, A., Rao, P., Pisco, A. O., Zou, J., Wang, B., Gandal, M., Raychaudhuri, S., Pasaniuc, B., & Price, A. L. (2022). Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data. Nature Genetics, 54(10), 1572–1580. https://doi.org/10.1038/s41588-022-01167-z

Continue reading



Guarantees of Lloyd’s Algorithm

September 10, 2024 (Update: )

This note is for Lu, Y., & Zhou, H. H. (2016). Statistical and Computational Guarantees of Lloyd’s Algorithm and its Variants (No. arXiv:1612.02099). arXiv. http://arxiv.org/abs/1612.02099

Continue reading



Data Thinning for Convolution-Closed Distributions

August 29, 2024

This note is for Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data Thinning for Convolution-Closed Distributions. Journal of Machine Learning Research, 25(57), 1–35.

Continue reading



Data Fission

August 05, 2024 (Update: )

This note is for the discussion paper Leiner, J., Duan, B., Wasserman, L., & Ramdas, A. (2023). Data fission: Splitting a single data point (arXiv:2112.11079). arXiv. http://arxiv.org/abs/2112.11079 in the JASA invited session at JSM 2024

Continue reading



Watermarks in Large Language Models

August 05, 2024 (Update: )

This is the note for the talk Statistical Inference in Large Language Models: A Statistical Framework of Watermarks given by Weijie Su at JSM 2024

Continue reading



See all posts →