Posted on Monday, January 1, 2024

Welcome to the New England Journal of Statistics in Data Science (NEJSDS) Workshop on Statistics in Data Science. This one-day event features a keynote session, two invited sessions, and conclude with an editorial board meeting.

The NEJSDS is the official journal of the NESS. The aims of the journal are to serve as an interface between statistics and other disciplines in data science, to encourage researchers to exchange innovative ideas, and to promote data science methods to the general scientific community.

Registration

Please register for free here and select if you will attend in-person and need a lunch ticket in the registration form.

Date and location

Date: Friday, January 26, 2024
In-persion Location: Room 101, Lawrence D. McHugh Hall, University of Connecticut, Storrs, CT 06269 Map
Online Participation Zoom Link

Keynote and Invited Speakers

Haim Bar, University of Connecticut
Rong Chen, Chair, Rutgers University
Yang Feng, New York University
Ofer Harel, University of Connecticut
Jeffrey Miller, Harvard University
Gavino Puggioni, University of Rhode Island
Weijie Su, University of Pennsylvania
Linjun Zhang, Rutgers University

Schedule

Welcome Remark

08:50-09:00 AM

Kolaczyk, President or NESS, Professor of Statistics at McGill University

Keynote session

09:00-09:45 AM, Keynote 1

Speaker: Rong Chen, Chair, Department of Statistics, Rutgers University
Title: Kronecker Product Approximation for Matrix Approximation, Denoising and Completion
Abstract: We consider the problem of matrix approximation, denoising and completion induced by the Kronecker product decomposition. Specifically, we propose to approximate a given matrix by the sum of a few Kronecker products of smaller matrices, which we refer to as the Kronecker product approximation (KoPA). Because the Kronecker product is an extensions of the outer product from vectors to matrices, KoPA extends the low rank matrix approximation, and include the latter as a special case. Comparing with the latter, KoPA also offers a greater flexibility, since it allows the user to choose the configuration, which are the dimensions of the two matrices forming the Kronecker product. As the configuration to be used is usually unknown, an extended information criterion is used to select the configuration. The model is extended to allow for multiple terms with different configurations (hybrid-KoPA) for more efficient approximation and denoising. It is also used for matrix completion tasks, with superior theoretical and numerical properties.

09:45-10:30 AM, Keynote 2

Speaker: Ofer Harel, Dean, College of Liberal Arts and Sciences, University of Connecticut
Title: Non-inferiority clinical trials: treating margin as missing information
Abstract: Non-inferiority (NI) clinical trials’ goal is to demonstrate that a new treatment is not worse than a standard of care by a certain amount called margin. The choice of non-inferiority margin is not straightforward as it depends on historical data, and clinical experts’ opinion. Knowing the “true”, objective clinical margin would be helpful for design and analysis of non-inferiority trials, but it is not possible in practice. We propose to treat non-inferiority margin as missing information. In order to recover an objective margin, we believe it is essential to conduct a survey among a group of representative clinical experts. We introduce a novel framework, where data obtained from a survey are combined with NI trial data, so that both an estimated clinically acceptable margin and its uncertainty are accounted for when claiming non-inferiority. Through simulations, we compare several methods for implementing this framework. We believe the proposed framework would lead to better informed decisions regarding new potentially non-inferior treatments and could help resolve current practical issues related to the choice of the margin. This is a joint work with Yulia Sidi and Benjamin Stockton.

Session 1

10:45 - 11:15 AM, Invited Talk 1

Speaker: Haim Bar, University of Connecticut
Title: On Graphical Models and Convex Geometry
Abstract: We introduce a mixture-model of beta distributions to identify significant correlations among P predictors when P is large. The method relies on theorems in convex geometry, which we use to show how to control the error rate of edge detection in graphical models. Our ‘betaMix’ method does not require any assumptions about the network structure, nor does it assume that the network is sparse. The results hold for a wide class of data generating distributions that include light-tailed and heavy-tailed spherically symmetric distributions.

11:15 - 11:45 AM, Invited Talk 2

Speaker: Jeffrey Miller, Harvard University
Title: Truth-agnostic diagnostics for calibration under misspecification
Abstract: Under model misspecification, it is known that usual Bayes posteriors often do not properly quantify uncertainty about the true or pseudo-true parameters. However, this lack of correct calibration is always defined with respect to a somewhat arbitrary notion of what is “true”, making it conceptually unappealing and difficult to check empirically. We introduce a notion of internally coherent uncertainty quantification that is completely agnostic to the truth. Specifically, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We show that, under misspecification, credible sets from the usual Bayes posterior can strongly violate this bound, indicating that it is not internally coherent. Meanwhile, the bagged posterior (BayesBag) typically satisfies the bound and yields more reproducible inferences under misspecification.

11:45 - 12:15 PM, Invited Talk 3

Speaker: Gavino Puggioni, University of Rhode Island
Title:
Abstract:

Lunch

12:30-1:30pm

South dinning hall (South Marketplace), Map.

Session 2

01:30 - 02:00 PM, Invited Talk 4

Speaker: Yang Feng, New York University
Title: Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models
Abstract: Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that not only can effectively utilize unknown similarities between related tasks but is also robust against a fraction of outlier tasks from arbitrary distributions. The proposed procedure is shown to achieve a minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. Moreover, we generalize our approach to tackle the problem of transfer learning for GMMs, where similar theoretical results are derived. Finally, we demonstrate the effectiveness of our methods through simulations and real data examples. To the best of our knowledge, this is the first work studying multi-task and transfer learning on GMMs with theoretical guarantees.

02:00 - 02:30 PM, Invited Talk 5

Speaker: Weijie Su, University of Pennsylvania
Title: Navigating the Societal Landscape of Generative AI: Opportunities and Challenges
Abstract: Generative AI such as large language models have rapidly emerged as a pivotal innovation in many areas of data science and machine learning . These technologies raise important questions when their use affects human decisions. In this talk, we explore three key concerns from a societal viewpoint: First, how to create fair AI systems that adequately represent and serve minority groups. Second, the challenge of reliably telling apart the outputs from AI and humans to maintain trust in digital communications. Third, the complex issue of using data that may be copyrighted to train these AI models. Throughout this talk, we will tackle these pressing issues of generative AI, simultaneously shedding light on the substantial opportunities these innovations present.

02:30 - 03:00 PM, Invited Talk 6

Speaker: Linjun Zhang, Rutgers University
Title: Fair conformal prediction and risk control
Abstract: Multi-calibration is a powerful and evolving concept originating in the field of algorithmic fairness. For a predictor $f$ that estimates the outcome y given covariates $x$, and for a function class $C$, multi-calibration requires that the predictor $f(x)$ and outcome y are indistinguishable under the class of auditors in $C$. Fairness is captured by incorporating demographic subgroups into the class of functions $C$. Recent work has shown that, by enriching the class $C$ to incorporate appropriate propensity re-weighting functions, multi-calibration also yields target-independent learning, wherein a model trained on a source domain performs well on unseen, future target domains(approximately) captured by the re-weightings. The multi-calibration notion is extended, and the power of an enriched class of mappings is explored. HappyMap, a generalization of multi-calibration, is proposed, which yields a wide range of new applications, including a new fairness notion for uncertainty quantification (conformal prediction), a novel technique for conformal prediction under covariate shift, and a different approach for fair risk control, while also yielding a unified understanding of several existing seemingly disparate algorithmic fairness notions and target-independent learning approaches. A single HappyMap meta-algorithm is given that captures all these results, together with a sufficiency condition for its success.

NEJSDS editorial meeting

3:30 - 6:30 PM

Zoom Link (different from the workshop online participation link)

NEJSDS workshop on Statistics in Data Science