October 08, 2025
This note is for Deng, Z., Zhang, J., Zhang, L., Ye, T., Coley, Y., Su, W. J., & Zou, J. (2022). FIFA: Making Fairness More Generalizable in Classifiers Trained on Imbalanced Data (No. arXiv:2206.02792). arXiv.
Continue reading
October 06, 2025
This note is for Zhang, L., Roth, A., & Zhang, L. (2024). Fair Risk Control: A Generalized Framework for Calibrating Multi-group Fairness Risks. Proceedings of the 41st International Conference on Machine Learning, 59783–59805.
Continue reading
October 05, 2025
This note is for Li, X., Ruan, F., Wang, H., Long, Q., & Su, W. J. (2025). Robust detection of watermarks for large language models under human edits. Journal of the Royal Statistical Society Series B.
Continue reading
October 04, 2025
This note is for Viladomat, J., Mazumder, R., McInturff, A., McCauley, D. J., & Hastie, T. (2014). Assessing the significance of global and local correlations under spatial autocorrelation: A nonparametric approach. Biometrics, 70(2), 409–418.
Continue reading
October 03, 2025
This note is for Wang, Y., Zang, C., Li, Z., Guo, C. C., Lai, D., & Wei, P. (2025). A comparative study of statistical methods for identifying differentially expressed genes in spatial transcriptomics (p. 2025.02.17.638726). bioRxiv.
seurat, by far the most popular tool for analyzing ST data, uses the Wilcoxon rank-sum test by default for differential expression analysis
the paper proposes a Generalized Score Test (GST) in the Generalized Estimating Equations (GEEs) framework as a robust solution for differential gene expression analysis in ST.
Introduction
a common task in ST data analysis involves identifying DE genes across pathological grades
this study considered two potential approaches: generalized linear mixed model (GLMM) and generalized estimating equations (GEE)
2.2 Generalized Linear Mixed Model (GLMM)
Let $Y_{ij}$ represent the gene expression count for gene at spatial location $i$.
$X_i$: binary dummy variable for pathology grade at location $i$
spatial coordinates $s_i = (s_{i1}, s_{i2})$ for spatial location $i$
the model is specified as
\[\log \mu_{ij} = X_i^T\beta + \varepsilon_{ij}\]
where
$\mu_{ij}$ is the expected count for gene $j$ at location $i$
$\beta = [\beta_{j0}, \beta_{j1}]^T$ is the vector of fixed effect coefficients, where $\beta_{j0}$ is the intercept and $\beta_{j1}$ is the effect of Grade B compared to Grade A
the random effect $\varepsilon_{ij}$ is assumed to follow a normal distribution, $\varepsilon_{ij}\sim N(0, V(\sigma_j^2, \kappa_j, \tau))$, representing the random effect for gene $j$ at the $i$-th spatial location
for a gievn gene $j$, the spatial covariance matrix $V(\sigma_j^2, k_j, \tau)$ is defined based on the distances between pairs of spatial locations
the $(i, i’)$-th element of $V(\sigma_j^2, k_j, \tau)$ is given by
\[V_{ii'}(\sigma_j^2, k_j, \tau) = \sigma_j^2R(\tau, k_j) = \sigma_j^2 \exp(-\tau_{ii'}/k_j)\]
$\tau_{ii’} = \Vert s_i - s_{i’}\Vert$ denotes the Euclidean distance between two spatial locations $i$ and $i’$
$k_j > 0$ is a parameter that determines the rate of decay in correlation with distance, with larger values of $k_j$ indicating stronger correlations and smaller semi-variances
the exponential spatial structure here is a specific case of the Matern correlation structure $R(\tau)$
test for DE genes across the two pathology grades, test
\[H_0: \beta_{j1} = 0\qquad H_a: \beta_{j1} \neq 0\]
2.3 Generalized Estimating Equations (GEE)
instead of explicitly modeling the spatial correlation structure using random effects, the GEE use a “working” correlation matrix to account for the spatial dependence between observations
the paper adopted the GEEs model with an independent working correlation structure by dividing the whose ST tissue into $m$ spatial clusters using $K$-means clustering.
the mean of $Y_{ij}$, denoted by $\mu_{ij}$, is linked to the covariates through a log link function, and the model is specified as
\[\log\mu_{ij} = X_i^T\beta\]
where $\beta = [\beta_{j0}, \beta_{j1}]$ is the vector of fixed effect coefficients.
The parameters $\beta$ are estimated by solving the GEE:
\[\sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i) = 0\]
where
$D_i$ is the derivative of the mean response with respect to $\beta$ in the cluster $i$: $D_i = \frac{\partial \mu_i}{\partial \beta}$
$V_i$ is the variance-covariance matrix of responses in the cluster $i$, which is a function of the working correlation matrix $R_i$ ($V_i = C_i^{1/2}R_iC_i^{1/2}$)
$C_i$ is the diagonal matrix that includes the variances of the individual observations within the cluster $i$
the robust variance estimate for the estimated coefficients is calculated by the sandwich estimator
\[\widehat{Var}(\hat\beta) = \left(\sum_{i=1}^m D_i^TV_i^{-1}D_i \right)^{-1}\left(\sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i)(Y_i - \mu_i)^TV_i^{-1}D_i\right)\left(\sum_{i=1}^m D_i^TV_i^{-1}D_i\right)^{-1}\]
the advantages of the GEE framework is that it produces consistent and asymptotically normal estimates of the parameters even if the working correlation structure is incorrectly specified
2.3.1 Robust Wald Test
the robust Wald test statistic $W$ is computed as
\[W = \frac{\hat\beta_{j1}}{\widehat{se}(\hat\beta_{j1})}\]
2.3.2 Generalized Score Test (GST)
The asymptomatic equivalence between the GST statistic and the robust Wald statistic holds only as the number of spatial clusters is large.
the score function $U(\hat\beta_0)$ is the derivative of the quasi-likelihood with respect to the parameter $\beta$:
\[U(\hat\beta_0) = \sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i)\]
The GST statistic $S$ is computed as
\[S = \frac{U(\hat\beta_{j1})}{\widehat{se}(U(\hat\beta_{j1}))}\]
3. Results
3.1 Simulation Studies
fit the GLMM to a breast cancer ST dataset from 10x Genomics. Two key spatial parameters:
$\sigma^2$: spatial variance that captures variability in gene expression across different spatial locations
$\kappa$: spatial correlation that controls the rate of decay in correlation with distance between spatial locations
use the estimated parameters to generate simulated data
3.2 Real Data
3.2.1 Breast Cancer ST dataset
3798 spots and 24923 genes, retaining the 3000 most variable genes
3.2.2 Prostate Cancer ST dataset
4371 spots and 16907 genes, retaining the 3000 most variable genes
4. Discussion
Continue reading
October 03, 2025
This note is for Xie, Y., Li, X., Mallick, T., Su, W., & Zhang, R. (2025). Debiasing Watermarks for Large Language Models via Maximal Coupling. Journal of the American Statistical Association, 0(0), 1–11.
Continue reading
October 02, 2025
This note is for the paper Zrnic, T., & Jordan, M. I. (2023). Post-selection inference via algorithmic stability. The Annals of Statistics, 51(4), 1666–1691. .
Continue reading
September 04, 2025 (Update: September 09, 2025 )
This note is for Jain, A., Montanari, A., & Sasoglu, E. (2024, November 6). Scaling laws for learning with real and surrogate data. The Thirty-eighth Annual Conference on Neural Information Processing Systems.
Continue reading
July 22, 2025
this note is for Tian, X., & Shen, X. (2025). Enhancing Accuracy in Generative Models via Knowledge Transfer (No. arXiv:2405.16837). arXiv.
Continue reading
June 12, 2025
This note is for Tang, D., Kong, D., & Wang, L. (2024). The synthetic instrument: From sparse association to sparse causation (No. arXiv:2304.01098). arXiv.
Continue reading
June 02, 2025
This note is for Miao, W., Hu ,Wenjie, Ogburn ,Elizabeth L., & and Zhou, X.-H. (2023). Identifying Effects of Multiple Treatments in the Presence of Unmeasured Confounding. Journal of the American Statistical Association, 118(543), 1953–1967.
Continue reading
May 01, 2025
This note is for Laves, M.-H., Ihler, S., Fast, J. F., Kahrs, L. A., & Ortmaier, T. (2020). Well-Calibrated Regression Uncertainty in Medical Imaging with Deep Learning. Proceedings of the Third Conference on Medical Imaging with Deep Learning, 393–412.
Continue reading
April 25, 2025
This is the note for Lahlou, S., Jain, M., Nekoei, H., Butoi, V. I., Bertin, P., Rector-Brooks, J., Korablyov, M., & Bengio, Y. (2023). DEUP: Direct Epistemic Uncertainty Prediction (No. arXiv:2102.08501). arXiv. https://doi.org/10.48550/arXiv.2102.08501
Continue reading
April 23, 2025
This note is for Upadhyay, U., Kim, J. M., Schmidt, C., Schölkopf, B., & Akata, Z. (2023). Likelihood Annealing: Fast Calibrated Uncertainty for Regression (No. arXiv:2302.11012). arXiv. https://doi.org/10.48550/arXiv.2302.11012
Continue reading
April 23, 2025
This note is for Cherian, J., Gibbs, I., & Candes, E. (2024, November 6). Large language model validity via enhanced conformal prediction methods. The Thirty-eighth Annual Conference on Neural Information Processing Systems.
Continue reading
April 21, 2025
This note is for Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631(8022), 755–759. https://doi.org/10.1038/s41586-024-07566-y
Continue reading
April 09, 2025
This note is for Liu, Y., & and Xie, J. (2020). Cauchy Combination Test: A Powerful Test With Analytic p-Value Calculation Under Arbitrary Dependency Structures. Journal of the American Statistical Association, 115(529), 393–402.
Continue reading
April 05, 2025
This note is for Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). TabDDPM: Modelling Tabular Data with Diffusion Models. Proceedings of the 40th International Conference on Machine Learning, 17564–17579.
Continue reading
February 23, 2025
This note is for Leng, Z., Tan, M., Liu, C., Cubuk, E. D., Shi, X., Cheng, S., & Anguelov, D. (2022). PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions (No. arXiv:2204.12511). arXiv. https://doi.org/10.48550/arXiv.2204.12511
Continue reading
February 23, 2025
This note is for Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., & Hinton, G. (2017). Regularizing Neural Networks by Penalizing Confident Output Distributions (No. arXiv:1701.06548). arXiv. https://doi.org/10.48550/arXiv.1701.06548
Continue reading
February 22, 2025
This note is for Müller, R., Kornblith, S., & Hinton, G. E. (2019). When does label smoothing help? Advances in Neural Information Processing Systems, 32.
Continue reading
February 22, 2025
This post is for Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P., & Dokania, P. (2020). Calibrating Deep Neural Networks using Focal Loss. Advances in Neural Information Processing Systems, 33, 15288–15299.
Continue reading
February 11, 2025
This note is for Wang, H., Ibrahim, S., & Mazumder, R. (2023). Nonparametric Finite Mixture Models with Possible Shape Constraints: A Cubic Newton Approach (No. arXiv:2107.08535). arXiv. https://doi.org/10.48550/arXiv.2107.08535
Continue reading
February 07, 2025
This note is for Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., & Hashimoto, T. (2025). s1: Simple test-time scaling (No. arXiv:2501.19393). arXiv. https://doi.org/10.48550/arXiv.2501.19393
Continue reading
February 07, 2025
This note is based on Hershey, J. R., & Olsen, P. A. (2007). Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 4, IV-317-IV–320. https://doi.org/10.1109/ICASSP.2007.366913
Continue reading
February 06, 2025
This post is for Cai, T. T., Ma, J., & Zhang, L. (2019). CHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics, 47(3), 1234–1267. https://doi.org/10.1214/18-AOS1711
Continue reading
February 06, 2025
This note is for Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859–877. https://doi.org/10.1080/01621459.2017.1285773
Continue reading
February 04, 2025
This post is for Dwivedi, R., Ho, N., Khamaru, K., Wainwright, M. J., Jordan, M. I., & Yu, B. (2020). Singularity, Misspecification and the Convergence Rate of Em. The Annals of Statistics, 48(6), 3161–3182.
Continue reading
January 28, 2025
This note is for Arthur, D., & Vassilvitskii, S. (2006). k-means++: The Advantages of Careful Seeding. Stanford, 11.
Continue reading
January 20, 2025
This note is for Cui, P., Zhang, D., Deng, Z., Dong, Y., & Zhu, J. (2023). Learning Sample Difficulty from Pre-trained Models for Reliable Prediction. Advances in Neural Information Processing Systems, 36, 25390–25408.
Continue reading
January 17, 2025
This note is for Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables (No. arXiv:1611.00712). arXiv. http://arxiv.org/abs/1611.00712
Continue reading
January 17, 2025
This note is for Zhu, K., Zhao, Q., Chen, H., Wang, J., & Xie, X. (2024). PromptBench: A Unified Library for Evaluation of Large Language Models.
Continue reading
December 13, 2024
This is the note for Qian, C., Wang, M., Ren, H., & Zou, C. (2024). ByMI: Byzantine Machine Identification with False Discovery Rate Control. Proceedings of the 41st International Conference on Machine Learning, 41357–41382. https://proceedings.mlr.press/v235/qian24b.html
Continue reading
December 13, 2024
This note is for Du, L., Guo, X., Sun, W., & Zou, C. (2023). False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation. Journal of the American Statistical Association, 118(541), 607–621. https://doi.org/10.1080/01621459.2021.1945459
Continue reading
December 10, 2024 (Update: December 12, 2024 )
This note is for Liu, W., Mao, X., Zhang, X., & Zhang, X. (2024). Robust Personalized Federated Learning with Sparse Penalization. Journal of the American Statistical Association, 0(0), 1–12. https://doi.org/10.1080/01621459.2024.2321652
Continue reading
December 10, 2024
This note is for Wang, C., Shen, J., Charalambous, C., & Pan, J. (2024). Modeling biomarker variability in joint analysis of longitudinal and time-to-event data. Biostatistics, 25(2), 577–596. https://doi.org/10.1093/biostatistics/kxad009 and Wang, C., Shen, J., Charalambous, C., & Pan, J. (2024). Weighted biomarker variability in joint analysis of longitudinal and time-to-event data. The Annals of Applied Statistics, 18(3), 2576–2595. https://doi.org/10.1214/24-AOAS1896
Continue reading
December 09, 2024
This note is for Ren, Z., & Barber, R. F. (2024). Derandomised knockoffs: Leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1), 122–154. https://doi.org/10.1093/jrsssb/qkad085
Continue reading
December 05, 2024
This note is for Wang, R., & Ramdas, A. (2022). False Discovery Rate Control with E-values. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3), 822–852. https://doi.org/10.1111/rssb.12489 and Aaditya’s talk at ISSI on October 25, 2023
Continue reading
December 03, 2024 (Update: April 21, 2025 )
This post is for Shen, X., Liu, Y., & Shen, R. (2024). Boosting Data Analytics With Synthetic Volume Expansion (No. arXiv:2310.17848). arXiv. https://doi.org/10.48550/arXiv.2310.17848
Continue reading
November 22, 2024
This note is for Miao, J., & Lu, Q. (2024). Task-Agnostic Machine-Learning-Assisted Inference (No. arXiv:2405.20039). arXiv. https://doi.org/10.48550/arXiv.2405.20039
Continue reading
November 22, 2024 (Update: November 23, 2024 )
This note is for Kobyzev, I., Prince, S. J. D., & Brubaker, M. A. (2020). Normalizing Flows: An Introduction and Review of Current Methods (No. arXiv:1908.09257). arXiv. http://arxiv.org/abs/1908.09257
Continue reading
November 12, 2024 (Update: November 23, 2024 )
This note is for Cable, D. M., Murray, E., Shanmugam, V., Zhang, S., Zou, L. S., Diao, M., Chen, H., Macosko, E. Z., Irizarry, R. A., & Chen, F. (2022). Cell type-specific inference of differential expression in spatial transcriptomics. Nature Methods, 19(9), 1076–1087. https://doi.org/10.1038/s41592-022-01575-3
Continue reading
November 04, 2024
This note is for Niu, Z., Choudhury, J. R., & Katsevich, E. (2024). Computationally efficient and statistically accurate conditional independence testing with spaCRT (No. arXiv:2407.08911; Version 1). arXiv. https://doi.org/10.48550/arXiv.2407.08911
Continue reading
November 01, 2024 (Update: November 04, 2024 )
This is the note for Moreau, T., Massias, M., Gramfort, A., Ablin, P., Bannier, P.-A., Charlier, B., Dagréou, M., Tour, T. D. la, Durif, G., Dantas, C. F., Klopfenstein, Q., Larsson, J., Lai, E., Lefort, T., Malézieux, B., Moufad, B., Nguyen, B. T., Rakotomamonjy, A., Ramzi, Z., … Vaiter, S. (2022). Benchopt: Reproducible, efficient and collaborative optimization benchmarks (No. arXiv:2206.13424). arXiv. https://doi.org/10.48550/arXiv.2206.13424
Continue reading
October 04, 2024
This post is based on He, J., Yalov, S., & Hahn, P. R. (2019). XBART: Accelerated Bayesian Additive Regression Trees. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 1130–1138. https://proceedings.mlr.press/v89/he19a.html and He, J., & Hahn, P. R. (2023). Stochastic Tree Ensembles for Regularized Nonlinear Regression. Journal of the American Statistical Association, 118(541), 551–570. https://doi.org/10.1080/01621459.2021.1942012
Continue reading
September 10, 2024 (Update: September 14, 2024 )
This note is for Zhang, M. J., Hou, K., Dey, K. K., Sakaue, S., Jagadeesh, K. A., Weinand, K., Taychameekiatchai, A., Rao, P., Pisco, A. O., Zou, J., Wang, B., Gandal, M., Raychaudhuri, S., Pasaniuc, B., & Price, A. L. (2022). Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data. Nature Genetics, 54(10), 1572–1580. https://doi.org/10.1038/s41588-022-01167-z
Continue reading
September 10, 2024 (Update: September 14, 2024 )
This note is for Lu, Y., & Zhou, H. H. (2016). Statistical and Computational Guarantees of Lloyd’s Algorithm and its Variants (No. arXiv:1612.02099). arXiv. http://arxiv.org/abs/1612.02099
Continue reading
August 29, 2024
This note is for Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data Thinning for Convolution-Closed Distributions. Journal of Machine Learning Research, 25(57), 1–35.
Continue reading
August 05, 2024 (Update: July 06, 2025 )
This note is for the discussion paper Leiner, J., Duan, B., Wasserman, L., & Ramdas, A. (2023). Data fission: Splitting a single data point (arXiv:2112.11079). arXiv. http://arxiv.org/abs/2112.11079 in the JASA invited session at JSM 2024
Continue reading
August 05, 2024 (Update: August 16, 2024 )
This is the note for the talk Statistical Inference in Large Language Models: A Statistical Framework of Watermarks given by Weijie Su at JSM 2024
Continue reading
See all posts →