Ishaq Aden-Ali | Publications

2026

Preprint

Optimal Online Discrepancy Minimization in Linear Time

Ishaq Aden-Ali

Preprint, 2026

Abs arXiv

We provide an online algorithm with the following guarantee: for any fixed sequence of vectors \(v_1,…,v_T ∈\mathbb{R}^d\)with \(\|v_i\|_2 \le 1\), the algorithm assigns each arriving vector \(v_t\) a random sign \(\varepsilon_t\) such that every prefix sum \(\sum_{i=1}^{t} \varepsilon_i v_i\) can be written as the sum of three coupled standard Gaussian vectors. Our algorithm runs in \(O(dT)\) time and achieves the optimal prefix discrepancy bound \(\max_{1 \le t \le T} \left\| \sum_{i=1}^{t} \varepsilon_i v_i \right\|_∞= O\left(\sqrt{\log T}\right)\) with high probability. This recovers the optimal bound of Kulkarni, Reis, and Rothvoss, whose algorithm runs in time exponential in \(T\) and \(d\). The algorithm and main proof were discovered in a GPT-5.5 Pro Extended conversation prompted by the author.
ICML

Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra, and Nika Haghtalab

International Conference on Machine Learning, 2026

Abs arXiv Code

Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model’s properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.

2025

Preprint

Quantization for Vector Search under Streaming Updates

Ishaq Aden-Ali, Hakan Ferhatosmanoglu, Alexander Greaves-Tunnell, Nina Mishra, and Tal Wagner

Preprint, 2025

Abs arXiv

Large-scale vector databases for approximate nearest neighbor (ANN) search typically store a quantized dataset in main memory for fast access, and full precision data on remote disk. State-of-the-art ANN quantization methods are highly data-dependent, rendering them unable to handle point insertions and deletions. This either leads to degraded search quality over time, or forces costly global rebuilds of the entire search index. In this paper, we formally study data-dependent quantization under streaming dataset updates. We formulate a computation model of limited remote disk access and define a dynamic consistency property that guarantees freshness under updates. We use it to obtain the following results: Theoretically, we prove that static data-dependent quantization can be made dynamic with bounded disk I/O per update while retaining formal accuracy guarantees for ANN search. Algorithmically, we develop a practical data-dependent quantization method which is provably dynamically consistent, adapting itself to the dataset as it evolves over time. Our experiments show that the method outperforms baselines in large-scale nearest neighbor search quantization under streaming updates.
Preprint

On the Injective Norm of Sums of Random Tensors and the Moments of Gaussian Chaoses

Ishaq Aden-Ali

Preprint, 2025

Abs arXiv

We prove an upper bound on the expected \( \ell_p \) injective norm of sums of subgaussian random tensors. Our proof is simple and does not rely on any explicit geometric or chaining arguments. Instead, it follows from a simple application of the PAC-Bayesian lemma, a tool that has proven effective at controlling the suprema of certain “smooth” empirical processes in recent years. Our bound strictly improves a very recent result of Bandeira, Gopi, Jiang, Lucca, and Rothvoss. In the Euclidean case (\(p=2\)), our bound sharpens a result of Latała that was central to proving his estimates on the moments of Gaussian chaoses. As a consequence, we obtain an elementary proof of this fundamental result.

2024

COLT

Majority-of-Three: The Simplest Optimal Learner?

Ishaq Aden-Ali, Mikael Møller Høgsgaard, Kasper Green Larsen, and Nikita Zhivotovskiy

Conference on Learning Theory, 2024

Abs arXiv

Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke’s algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets of the data. It is thus a natural goal to determine the simplest algorithm that is optimal. In this work we study the arguably simplest algorithm that could be optimal: returning the majority vote of three ERM classifiers. We show that this algorithm achieves the optimal in-expectation bound on its error which is provably unattainable by a single ERM classifier. Furthermore, we prove a near-optimal high-probability bound on this algorithm’s error. We conjecture that a better analysis will prove that this algorithm is in fact optimal in the high-probability regime.
RANDOM

On The Amortized Complexity of Approximate Counting

Ishaq Aden-Ali, Yanjun Han, Jelani Nelson, and Huacheng Yu

International Conference on Randomization and Computation, 2024

Abs arXiv

Naively storing a counter up to value n would require Ω(logn) bits of memory. Nelson and Yu [NY22], following work of [Morris78], showed that if the query answers need only be (1+ϵ)-approximate with probability at least 1−δ, then O(loglogn+loglog(1/δ)+log(1/ϵ)) bits suffice, and in fact this bound is tight. Morris’ original motivation for studying this problem though, as well as modern applications, require not only maintaining one counter, but rather k counters for k large. This motivates the following question: for k large, can k counters be simultaneously maintained using asymptotically less memory than k times the cost of an individual counter? That is to say, does this problem benefit from an improved \it amortized space complexity bound? We answer this question in the negative. Specifically, we prove a lower bound for nearly the full range of parameters showing that, in terms of memory usage, there is no asymptotic benefit possible via amortization when storing multiple counters. Our main proof utilizes a certain notion of "information cost" recently introduced by Braverman, Garg and Woodruff in FOCS 2020 to prove lower bounds for streaming algorithms.

2023

FOCS

Optimal PAC Bounds Without Uniform Convergence

Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy

IEEE Symposium on Foundations of Computer Science, 2023

Abs arXiv

In statistical learning theory, determining the sample complexity of realizable binary classification for VC classes was a long-standing open problem. The results of Simon and Hanneke established sharp upper bounds in this setting. However, the reliance of their argument on the uniform convergence principle limits its applicability to more general learning settings such as multiclass classification. In this paper, we address this issue by providing optimal high probability risk bounds through a framework that surpasses the limitations of uniform convergence arguments. Our framework converts the leave-one-out error of permutation invariant predictors into high probability risk bounds. As an application, by adapting the one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth, we propose an algorithm that achieves an optimal PAC bound for binary classification. Specifically, our result shows that certain aggregations of one-inclusion graph algorithms are optimal, addressing a variant of a classic question posed by Warmuth. We further instantiate our framework in three settings where uniform convergence is provably suboptimal. For multiclass classification, we prove an optimal risk bound that scales with the one-inclusion hypergraph density of the class, addressing the suboptimality of the analysis of Daniely and Shalev-Shwartz. For partial hypothesis classification, we determine the optimal sample complexity bound, resolving a question posed by Alon, Hanneke, Holzman, and Moran. For realizable bounded regression with absolute loss, we derive an optimal risk bound that relies on a modified version of the scale-sensitive dimension, refining the results of Bartlett and Long. Our rates surpass standard uniform convergence-based results due to the smaller complexity measure in our risk bound.
COLT

The One-Inclusion Graph Algorithm is not Always Optimal

Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy

Conference on Learning Theory, 2023

Abs arXiv

The one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth achieves an optimal in-expectation risk bound in the standard PAC classification setup. In one of the first COLT open problems, Warmuth conjectured that this prediction strategy always implies an optimal high probability bound on the risk, and hence is also an optimal PAC algorithm. We refute this conjecture in the strongest sense: for any practically interesting Vapnik-Chervonenkis class, we provide an in-expectation optimal one-inclusion graph algorithm whose high probability risk bound cannot go beyond that implied by Markov’s inequality. Our construction of these poorly performing one-inclusion graph algorithms uses Varshamov-Tenengolts error correcting codes. Our negative result has several implications. First, it shows that the same poor high-probability performance is inherited by several recent prediction strategies based on generalizations of the one-inclusion graph algorithm. Second, our analysis shows yet another statistical problem that enjoys an estimator that is provably optimal in expectation via a leave-one-out argument, but fails in the high-probability regime. This discrepancy occurs despite the boundedness of the binary loss for which arguments based on concentration inequalities often provide sharp high probability risk bounds.

2022

2021

NeurIPS

Privately Learning Mixtures of Axis-Aligned Gaussians

Ishaq Aden-Ali, Hassan Ashtiani, and Chris Liaw

Conference on Neural Information Processing Systems, 2021

arXiv
ALT

On the Sample Complexity of Privately Learning Unbounded High-Dimensional Gaussians

Ishaq Aden-Ali, Hassan Ashtiani, and Gautam Kamath

International Conference on Algorithmic Learning Theory, 2021

Abs arXiv

We provide sample complexity upper bounds for agnostically learning multivariate Gaussians under the constraint of approximate differential privacy. These are the first finite sample upper bounds for general Gaussians which do not impose restrictions on the parameters of the distribution. Our bounds are near-optimal in the case when the covariance is known to be the identity, and conjectured to be near-optimal in the general case. From a technical standpoint, we provide analytic tools for arguing the existence of global “locally small” covers from local covers of the space. These are exploited using modifications of recent techniques for differentially private hypothesis selection. Our techniques may prove useful for privately learning other distribution classes which do not possess a finite cover.

2020

AISTATS

On the Sample Complexity of Learning Sum-Product Networks

Ishaq Aden-Ali, and Hassan Ashtiani

International Conference on Artificial Intelligence and Statistics, 2020

Abs arXiv

Sum-Product Networks (SPNs) can be regarded as a form of deep graphical models that compactly represent deeply factored and mixed distributions. An SPN is a rooted directed acyclic graph (DAG) consisting of a set of leaves (corresponding to base distributions), a set of sum nodes (which represent mixtures of their children distributions) and a set of product nodes (representing the products of its children distributions). In this work, we initiate the study of the sample complexity of PAC-learning the set of distributions that correspond to SPNs. We show that the sample complexity of learning tree structured SPNs with the usual type of leaves (i.e., Gaussian or discrete) grows at most linearly (up to logarithmic factors) with the number of parameters of the SPN. More specifically, we show that the class of distributions that corresponds to tree structured Gaussian SPNs with k mixing weights and e (d-dimensional Gaussian) leaves can be learned within Total Variation error ε using at most Õ((ed²+k)/ε²) samples. A similar result holds for tree structured SPNs with discrete leaves. We obtain the upper bounds based on the recently proposed notion of distribution compression schemes. More specifically, we show that if a (base) class of distributions ℱ admits an "efficient" compression, then the class of tree structured SPNs with leaves from ℱ also admits an efficient compression.

2018

Time-resolved diffuse optical tomography system using an accelerated inverse problem solver*

Mrwan Alayed, Mohamed Naser, Ishaq Aden-Ali, and M. Jamal Deen

Optics Express, 2018

HTML

2017

SISPAD

Novel experimentally calibrated multiphase TCAD model for cobalt germanide growth*

Mohamed Rabie, Ishaq Aden-Ali, and Yaser Haddara

International Conference on Simulation of Semiconductor Processes and Devices, 2017

HTML