Publications

A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch European Conference on Computer Vision 2022(To appear)We address the problem of retrieving inthewild images with both a sketch and a text query. We present TASKformer (Text And SKetch transformer), an endtoend trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASKformer follows the latefusion dualencoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional textbased image retrieval. To evaluate our approach, we collect 5,000 handdrawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.
@inproceedings{sangkloy2022, title = {A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch}, author = {Sangkloy, Patsorn and Jitkrittum, Wittawat and Yang, Diyi and Hays, James}, booktitle = {European Conference on Computer Vision}, year = {2022}, wj_note = {(To appear)}, wj_code = {https://janesjanes.github.io/tsbir/} }

Discussion of Multiscale Fisher’s Independence Test for Multivariate Dependence Biometrika 2022We discuss how MultiFIT, the Multiscale Fisher’s Independence Test for Multivariate Dependence proposed by Gorsky and Ma (2022), compares to existing lineartime kernel tests based on the HilbertSchmidt independence criterion (HSIC). We highlight the fact that the levels of the kernel tests at any finite sample size can be controlled exactly, as it is the case with the level of MultiFIT. In our experiments, we observe some of the performance limitations of MultiFIT in terms of test power.
@article{schrab2022discussion, title = {Discussion of Multiscale Fisher's Independence Test for Multivariate Dependence}, author = {Schrab, Antonin and Jitkrittum, Wittawat and Szab{\'o}, Zolt{\'a}n and Sejdinovic, Dino and Gretton, Arthur}, journal = {Biometrika}, year = {2022}, wj_http = {https://arxiv.org/abs/2206.11142} }

ELM: Embedding and Logit Margins for LongTail Learning 2022Longtail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models and neural models. However, when applied to neural models, such techniques do not explicitly control the geometry of the learned embeddings. This can be potentially suboptimal, since embeddings for tail classes may be diffuse, resulting in poor generalization for these classes. We present Embedding and Logit Margins (ELM), a unified approach to enforce margins in logit space, and regularize the distribution of embeddings. This connects losses for longtail learning to proposals in the literature on metric embedding, and contrastive learning. We theoretically show that minimising the proposed ELM objective helps reduce the generalisation gap. The ELM method is shown to perform well empirically, and results in tighter tail class embeddings.
@misc{elm2022, doi = {10.48550/ARXIV.2204.13208}, url = {https://arxiv.org/abs/2204.13208}, author = {Jitkrittum, Wittawat and Menon, Aditya Krishna and Rawat, Ankit Singh and Kumar, Sanjiv}, keywords = {Machine Learning (cs.LG), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {ELM: Embedding and Logit Margins for LongTail Learning}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International}, wj_http = {https://arxiv.org/abs/2204.13208} }

A Witness TwoSample Test Proceedings of The 25th International Conference on Artificial Intelligence and Statistics 2022The Maximum Mean Discrepancy (MMD) has been the stateoftheart nonparametric test for tackling the twosample problem. Its statistic is given by the difference in expectations of the witness function, a realvalued function defined as a weighted sum of kernel evaluations on a set of basis points. Typically the kernel is optimized on a training set, and hypothesis testing is performed on a separate test set to avoid overfitting (i.e., control typeI error). That is, the test set is used to simultaneously estimate the expectations and define the basis points, while the training set only serves to select the kernel and is discarded. In this work, we propose to use the training data to also define the weights and the basis points for better data efficiency. We show that 1) the new test is consistent and has a wellcontrolled typeI error; 2) the optimal witness function is given by a precisionweighted mean in the reproducing kernel Hilbert space associated with the kernel; and 3) the test power of the proposed test is comparable or exceeds that of the MMD and other modern tests, as verified empirically on challenging synthetic and real problems (e.g., Higgs data).
@inproceedings{pmlrv151kubler22a, title = { A Witness TwoSample Test }, author = {K\"ubler, Jonas M. and Jitkrittum, Wittawat and Sch\"olkopf, Bernhard and Muandet, Krikamol}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {14031419}, year = {2022}, editor = {CampsValls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {2830 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/kubler22a/kubler22a.pdf}, url = {https://proceedings.mlr.press/v151/kubler22a.html}, wj_http = {https://arxiv.org/abs/2102.05573}, wj_talk = {https://slideslive.ch/38980440/awitnesstwosampletest?ref=speaker17291}, wj_code = {https://github.com/jmkuebler/witstest} }

ABCDP: Approximate Bayesian Computation with Differential Privacy Entropy 2021We developed a novel approximate Bayesian computation (ABC) framework, ABCDP, which produces differentially private (DP) and approximate posterior samples. Our framework takes advantage of the sparse vector technique (SVT), widely studied in the differential privacy literature. SVT incurs the privacy cost only when a condition (whether a quantity of interest is above/below a threshold) is met. If the condition is sparsely met during the repeated queries, SVT can drastically reduce the cumulative privacy loss, unlike the usual case where every query incurs the privacy loss. In ABC, the quantity of interest is the distance between observed and simulated data, and only when the distance is below a threshold can we take the corresponding prior sample as a posterior sample. Hence, applying SVT to ABC is an organic way to transform an ABC algorithm to a privacypreserving variant with minimal modification, but yields the posterior samples with a high privacy level. We theoretically analyzed the interplay between the noise added for privacy and the accuracy of the posterior samples. We apply ABCDP to several data simulators and show the efficacy of the proposed framework.
@article{abcdp2021, author = {Park, Mijung and Vinaroz, Margarita and Jitkrittum, Wittawat}, title = {ABCDP: Approximate Bayesian Computation with Differential Privacy}, journal = {Entropy}, volume = {23}, year = {2021}, number = {8}, articlenumber = {961}, url = {https://www.mdpi.com/10994300/23/8/961}, issn = {10994300}, doi = {10.3390/e23080961}, wj_http = {https://www.mdpi.com/10994300/23/8/961}, wj_code = {https://github.com/ParkLabML/ABCDP} }

Disentangling Sampling and Labeling Bias for Learning in LargeOutput Spaces ICML 2021Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly tradeoff performance on dominant versus rare labels. Further, we provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance. We empirically verify our findings on longtail classification and retrieval benchmarks.
@inproceedings{rawat2021, title = {Disentangling Sampling and Labeling Bias for Learning in LargeOutput Spaces}, author = {Rawat, Ankit Singh and Menon, Aditya Krishna and Jitkrittum, Wittawat and Jayasumana, Sadeep and Yu, Felix X. and Reddi, Sashank and Kumar, Sanjiv}, booktitle = {ICML}, year = {2021}, wj_http = {https://arxiv.org/abs/2105.05736} }

Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation Proceedings of The 24th International Conference on Artificial Intelligence and Statistics 2021We propose kernel distributionally robust optimization (Kernel DRO) using insights from the robust optimization theory and functional analysis. Our method uses reproducing kernel Hilbert spaces (RKHS) to construct a wide range of convex ambiguity sets, which can be generalized to sets based on integral probability metrics and finiteorder moment bounds. This perspective unifies multiple existing robust and stochastic optimization methods. We prove a theorem that generalizes the classical duality in the mathematical problem of moments. Enabled by this theorem, we reformulate the maximization with respect to measures in DRO into the dual program that searches for RKHS functions. Using universal RKHSs, the theorem applies to a broad class of loss functions, lifting common limitations such as polynomial losses and knowledge of the Lipschitz constant. We then establish a connection between DRO and stochastic optimization with expectation constraints. Finally, we propose practical algorithms based on both batch convex solvers and stochastic functional gradient, which apply to general optimization and machine learning tasks.
@inproceedings{zhu2021kernel, title = { Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation }, author = {Zhu, JiaJie and Jitkrittum, Wittawat and Diehl, Moritz and Sch{\"o}lkopf, Bernhard}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {280288}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {1315 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/zhu21a/zhu21a.pdf}, url = {http://proceedings.mlr.press/v130/zhu21a.html}, wj_http = {http://proceedings.mlr.press/v130/zhu21a.html}, wj_code = {https://github.com/jjzhu/kdro} }

More Powerful Selective Kernel Tests for Feature Selection Proceedings of The 23rd International Conference on Artificial Intelligence and Statistics 2020Refining one’s hypotheses in light of data is a commonplace scientific practice, however,this approach introduces selection bias and can lead to specious statisticalanalysis.One approach of addressing this phenomena is via conditioning on the selection procedure, i.e., how we have used the data to generate our hypotheses, and prevents information to be used again after selection.Many selective inference (a.k.a. postselection inference) algorithms typically take this approach but will “overcondition”for sake of tractability. While this practice obtains well calibrated pvalues,it can incur a major loss in power. In our work, we extend two recent proposals for selecting features using the Maximum Mean Discrepancyand Hilbert Schmidt Independence Criterion to condition on the minimalconditioning event. We show how recent advances inmultiscale bootstrap makesthis possible and demonstrate our proposal over a range of synthetic and real world experiments.Our results show that our proposed test is indeed more powerful in most scenarios.
@inproceedings{pmlrv108lim20a, title = {More Powerful Selective Kernel Tests for Feature Selection}, author = {Lim, Jen Ning and Yamada, Makoto and Jitkrittum, Wittawat and Terada, Yoshikazu and Matsui, Shigeyuki and Shimodaira, Hidetoshi}, booktitle = {Proceedings of The 23rd International Conference on Artificial Intelligence and Statistics}, pages = {820830}, year = {2020}, editor = {Chiappa, Silvia and Calandra, Roberto}, volume = {108}, series = {Proceedings of Machine Learning Research}, address = {Online}, month = {2628 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v108/lim20a/lim20a.pdf}, url = {http://proceedings.mlr.press/v108/lim20a.html}, wj_http = {http://proceedings.mlr.press/v108/lim20a.html}, wj_code = {https://github.com/jenninglim/multiscalefeatures} }

Learning Kernel Tests Without Data Splitting NeurIPS 2020Modern largescale kernelbased tests such as maximum mean discrepancy (MMD) and kernelized Stein discrepancy (KSD) optimize kernel hyperparameters on a heldout sample via data splitting to obtain the most powerful test statistics. While data splitting results in a tractable null distribution, it suffers from a reduction in test power due to smaller test sample size. Inspired by the selective inference framework, we propose an approach that enables learning the hyperparameters and testing on the full sample without data splitting. Our approach can correctly calibrate the test in the presence of such dependency, and yield a test threshold in closed form. At the same significance level, our approach’s test power is empirically larger than that of the datasplitting approach, regardless of its split proportion.
@incollection{kbler2020learning, title = {Learning Kernel Tests Without Data Splitting}, booktitle = {NeurIPS}, author = {K\"{u}bler, Jonas M. and Jitkrittum, Wittawat and Sch\"{o}lkopf, Bernhard and Muandet, Krikamol}, year = {2020}, wj_http = {https://arxiv.org/abs/2006.02286}, wj_code = {https://github.com/MPIIS/testswosplitting}, wj_talk = {https://nips.cc/virtual/2020/protected/poster_44f683a84163b3523afe57c2e008bc8c.html} }

WorstCase Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem 2020 59th IEEE Conference on Decision and Control (CDC) 2020In order to anticipate rare and impactful events, we propose to quantify the worstcase risk under distributional ambiguity using a recent development in kernel methods  the kernel mean embedding. Specifically, we formulate the generalized moment problem whose ambiguity set (i.e., the moment constraint) is described by constraints in the associated reproducing kernel Hilbert space in a nonparametric manner. We then present the tractable approximation and its theoretical justification. As a concrete application, we numerically test the proposed method in characterizing the worstcase constraint violation probability in the context of a constrained stochastic control system.
@inproceedings{9303938, author = {Zhu, JiaJie and Jitkrittum, Wittawat and Diehl, Moritz and Schölkopf, Bernhard}, booktitle = {2020 59th IEEE Conference on Decision and Control (CDC)}, title = {WorstCase Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem}, year = {2020}, volume = {}, number = {}, pages = {34573463}, doi = {10.1109/CDC42340.2020.9303938}, wj_http = {https://arxiv.org/abs/2004.00166} }

Testing Goodness of Fit of Conditional Density Models with Kernels The Conference on Uncertainty in Artificial Intelligence (UAI) 2020
abstract paper bib code talk slides
We propose two nonparametric statistical tests of goodness of fit for conditional distributions: given a conditional probability density function p(yx) and a joint sample, decide whether the sample is drawn from p(yx)rx(x) for some density rx. Our tests, formulated with a Stein operator, can be applied to any differentiable conditional density model, and require no knowledge of the normalizing constant. We show that 1) our tests are consistent against any fixed alternative conditional model; 2) the statistics can be estimated easily, requiring no density estimation as an intermediate step; and 3) our second test offers an interpretable test result providing insight on where the conditional model does not fit well in the domain of the covariate. We demonstrate the interpretability of our test on a task of modeling the distribution of New York City’s taxi dropoff location given a pickup point. To our knowledge, our work is the first to propose such conditional goodnessoffit tests that simultaneously have all these desirable properties.@conference{jitkrittum2020testing, title = {Testing Goodness of Fit of Conditional Density Models with Kernels}, author = {Jitkrittum, Wittawat and Kanagawa, Heishiro and Sch\"{o}lkopf, Bernhard}, booktitle = {The Conference on Uncertainty in Artificial Intelligence (UAI)}, year = {2020}, url = {https://arxiv.org/abs/2002.10271}, wj_http = {https://arxiv.org/abs/2002.10271}, wj_code = {https://github.com/wittawatj/kernelcgof}, wj_slides = {https://docs.google.com/presentation/d/14I4ndHux8C3ImRzAqmNPVLmLQkw8wZQucMMTZrACUXk/edit?usp=sharing}, wj_talk = {https://www.youtube.com/watch?v=RGz6LfZwqDA&list=PLTrdDEfEeShmhkbbCtmaPst7f7CFll0kc&index=95} }

Kernel Conditional Moment Test via Maximum Moment Restriction The Conference on Uncertainty in Artificial Intelligence (UAI) 2020We propose a new family of specification tests called kernel conditional moment (KCM) tests. Our tests are built on conditional moment embeddings (CMME)—a novel representation of conditional moment restrictions in a reproducing kernel Hilbert space (RKHS). After transforming the conditional moment restrictions into a continuum of unconditional counterparts, the test statistic is defined as the maximum moment restriction within the unit ball of the RKHS. We show that the CMME fully characterizes the original conditional moment restrictions, leading to consistency in both hypothesis testing and parameter estimation. The proposed test also has an analytic expression that is easy to compute as well as closedform asymptotic distributions. Our empirical studies show that the KCM test has a promising finitesample performance compared to existing tests.
@conference{mmrtest2020, author = {{Muandet}, Krikamol and {Jitkrittum}, Wittawat and {K{\"u}bler}, Jonas}, booktitle = {The Conference on Uncertainty in Artificial Intelligence (UAI)}, title = {Kernel Conditional Moment Test via Maximum Moment Restriction}, year = {2020}, wj_http = {https://arxiv.org/abs/2002.09225}, wj_code = {https://github.com/krikamol/kcmtest}, wj_talk = {https://www.youtube.com/watch?v=XkkXCXbsRpY} }

Kernel Stein Tests for Multiple Model Comparison NeurIPS 2019
abstract paper bib poster code
We address the problem of nonparametric multiple model comparison: given l candidate models, decide whether each candidate is as good as the best one(s) in the list (negative), or worse (positive). We propose two statistical tests, each controlling a different notion of decision errors. The first test, building on the post selection inference framework, provably controls the fraction of best models that are wrongly declared worse (false positive rate). The second test is based on multiple correction, and controls the fraction of the models declared worse that are in fact as good as the best (false discovery rate). We prove that under some conditions the first test can yield a higher true positive rate than the second. Experimental results on toy and real (CelebA, Chicago Crime data) problems show that the two tests have high true positive rates with wellcontrolled error rates. By contrast, the naive approach of choosing the model with the lowest score without correction leads to a large number of false positives.@incollection{NIPS2019_8496, title = {Kernel Stein Tests for Multiple Model Comparison}, author = {Lim, Jen Ning and Yamada, Makoto and Sch\"{o}lkopf, Bernhard and Jitkrittum, Wittawat}, booktitle = {NeurIPS}, editor = {Wallach, H. and Larochelle, H. and Beygelzimer, A. and d\textquotesingle Alch\'{e}Buc, F. and Fox, E. and Garnett, R.}, pages = {22402250}, year = {2019}, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/8496kernelsteintestsformultiplemodelcomparison.pdf}, wj_http = {https://arxiv.org/abs/1910.12252}, wj_code = {https://github.com/jenninglim/modelcomparisontest}, wj_poster = {https://github.com/jenninglim/modelcomparisontest/raw/master/docs/poster.pdf} }

Fisher Efficient Inference of Intractable Models NeurIPS 2019
abstract paper bib code slides
Maximum Likelihood Estimators (MLE) has many good properties. For example, the asymptotic variance of MLE solution attains equality of the asymptotic CramérRao lower bound (efficiency bound), which is the minimum possible variance for an unbiased estimator. However, obtaining such MLE solution requires calculating the likelihood function which may not be tractable due to the normalization term of the density model. In this paper, we derive a Discriminative Likelihood Estimator (DLE) from the KullbackLeibler divergence minimization criterion implemented via density ratio estimation procedure and Stein operator. We study the problem of model inference using DLE. We prove its consistency and show the asymptotic variance of its solution can also attain the equality of the efficiency bound under mild regularity conditions. We also propose a dual formulation of DLE which can be easily optimized. Numerical studies validate our asymptotic theorems and we give an example where DLE successfully estimates an intractable model constructed using a pretrained deep neural network.@incollection{NIPS2019_9083, title = {Fisher Efficient Inference of Intractable Models}, author = {Liu, Song and Kanamori, Takafumi and Jitkrittum, Wittawat and Chen, Yu}, booktitle = {NeurIPS}, editor = {Wallach, H. and Larochelle, H. and Beygelzimer, A. and d\textquotesingle Alch\'{e}Buc, F. and Fox, E. and Garnett, R.}, pages = {87908800}, year = {2019}, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/9083fisherefficientinferenceofintractablemodels.pdf}, wj_http = {https://arxiv.org/abs/1805.07454}, wj_code = {https://github.com/anewgithubname/SteinDensityRatioEstimation}, wj_slides = {https://github.com/lamfeeling/SteinDensityRatioEstimation/blob/master/slides.pdf} }

A Kernel Stein Test for Comparing Latent Variable Models ArXiv 2019We propose a nonparametric, kernelbased test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. Our test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which required exact access to unnormalized densities. Our new test relies on the simple idea of using an approximate observedvariable marginal in place of the exact, intractable one. As our main theoretical contribution, we prove that the new test, with a properly corrected threshold, has a wellcontrolled typeI error. In the case of models with lowdimensional latent structure and highdimensional observations, our test significantly outperforms the relative maximum mean discrepancy test (Bounliphone et al., 2015) , which cannot exploit the latent structure.
@article{latent_stein_test2019, author = {{Kanagawa}, Heishiro and {Jitkrittum}, Wittawat and {Mackey}, Lester and {Fukumizu}, Kenji and {Gretton}, Arthur}, title = {A Kernel Stein Test for Comparing Latent Variable Models}, journal = {ArXiv}, keywords = {Statistics  Machine Learning, Computer Science  Machine Learning}, year = {2019}, month = jul, eid = {arXiv:1907.00586}, pages = {arXiv:1907.00586}, archiveprefix = {arXiv}, eprint = {1907.00586}, primaryclass = {stat.ML}, wj_http = {https://arxiv.org/abs/1907.00586} }

Generate Semantically Similar Images with Kernel Mean Matching Women in Computer Vision Workshop, CVPR 2019Oral presentation. 6 out of 64 accepted papers.
@misc{cagan_wicv2019, author = {Jitkrittum, Wittawat and Sangkloy, Patsorn and Gondal, Muhammad Waleed and Raj, Amit and Hays, James and {Sch{\"o}lkopf}, Bernhard}, title = {Generate Semantically Similar Images with Kernel Mean Matching}, howpublished = {Women in Computer Vision Workshop, CVPR}, year = {2019}, note = {The first two authors contributed equally. }, wj_http = {/assets/papers/cagan_wicv2019.pdf}, wj_highlight = {Oral presentation. 6 out of 64 accepted papers.}, wj_code = {https://github.com/wittawatj/cadgan} }

Kernel Mean Matching for Content Addressability of GANs ICML 2019Long oral presentation
abstract paper bib poster code talk slides
We propose a novel procedure which adds "contentaddressability" to any given unconditional implicit model e.g., a generative adversarial network (GAN). The procedure allows users to control the generative process by specifying a set (arbitrary size) of desired examples based on which similar samples are generated from the model. The proposed approach, based on kernel mean matching, is applicable to any generative models which transform latent vectors to samples, and does not require retraining of the model. Experiments on various highdimensional image generation problems (CelebAHQ, LSUN bedroom, bridge, tower) show that our approach is able to generate images which are consistent with the input set, while retaining the image quality of the original model. To our knowledge, this is the first work that attempts to construct, at test time, a contentaddressable generative model from a trained marginal model.@inproceedings{cagan_icml2019, author = {Jitkrittum, Wittawat and Sangkloy, Patsorn and Gondal, Muhammad Waleed and Raj, Amit and Hays, James and {Sch{\"o}lkopf}, Bernhard}, title = {Kernel Mean Matching for Content Addressability of {GANs}}, booktitle = {ICML}, year = {2019}, note = {The first two authors contributed equally.}, wj_http = {https://arxiv.org/abs/1905.05882}, wj_code = {https://github.com/wittawatj/cadgan}, wj_poster = {/assets/poster/cadgan_poster_icml2019.pdf}, wj_slides = {https://docs.google.com/presentation/d/1XdsSP7jji2QB_tZf8QAYCJhs6QzGDmR_BLmOHLFCMqY/edit?usp=sharing}, wj_talk = {https://slideslive.com/38917639/applicationscomputervision}, wj_highlight = {Long oral presentation} }

KernelGuided Training of Implicit Generative Models with Stability Guarantees ArXiv 2019Modern implicit generative models such as generative adversarial networks (GANs) are generally known to suffer from issues such as instability, uninterpretability, and difficulty in assessing their performance. If we see these implicit models as dynamical systems, some of these issues are caused by being unable to control their behavior in a meaningful way during the course of training. In this work, we propose a theoretically grounded method to guide the training trajectories of GANs by augmenting the GAN loss function with a kernelbased regularization term that controls local and global discrepancies between the model and true distributions. This control signal allows us to inject prior knowledge into the model. We provide theoretical guarantees on the stability of the resulting dynamical system and demonstrate different aspects of it via a wide range of experiments.
@article{witness_gan_rkhs2019, author = {{Mehrjou}, Arash and {Jitkrittum}, Wittawat and {Muandet}, Krikamol and {Sch{\"o}lkopf}, Bernhard}, title = {{KernelGuided Training of Implicit Generative Models with Stability Guarantees}}, journal = {ArXiv}, keywords = {Computer Science  Machine Learning, Statistics  Machine Learning}, year = {2019}, month = jan, eid = {arXiv:1901.09206}, eprint = {1901.09206}, primaryclass = {cs.LG}, wj_http = {https://arxiv.org/abs/1901.09206} }

Informative Features for Model Comparison NeurIPS 2018A lineartime test of relative goodness of fit of two models on a dataset. The test can produce evidence indicating where one model is better than the other. Applicable to implicit models such as GANs.
abstract paper bib poster code
Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a realworld problem of comparing GAN models, the test power of our new test matches that of the stateoftheart test of relative goodness of fit, while being one order of magnitude faster.@inproceedings{jitkrittum_kmod2018, title = {Informative Features for Model Comparison}, author = {Jitkrittum, Wittawat and Kanagawa, Heishiro and Sangkloy, Patsorn and Hays, James and Sch\"{o}lkopf, Bernhard and Gretton, Arthur}, booktitle = {NeurIPS}, year = {2018}, wj_http = {https://arxiv.org/abs/1810.11630}, wj_code = {https://github.com/wittawatj/kernelmod}, wj_poster = {/assets/poster/kmod_nips2018_poster.pdf}, wj_img = {cover_nips2018.png}, wj_summary = { A lineartime test of relative goodness of fit of two models on a dataset. The test can produce evidence indicating where one model is better than the other. Applicable to implicit models such as GANs. } }

Large Sample Analysis of the Median Heuristic ArXiv 2018In kernel methods, the median heuristic has been widely used as a way of setting the bandwidth of RBF kernels. While its empirical performances make it a safe choice under many circumstances, there is little theoretical understanding of why this is the case. Our aim in this paper is to advance our understanding of the median heuristic by focusing on the setting of kernel twosample test. We collect new findings that may be of interest for both theoreticians and practitioners. In theory, we provide a convergence analysis that shows the asymptotic normality of the bandwidth chosen by the median heuristic in the setting of kernel twosample test. Systematic empirical investigations are also conducted in simple settings, comparing the performances based on the bandwidths chosen by the median heuristic and those by the maximization of test power.
@article{median_heu_2018, author = {{Garreau}, Damien and {Jitkrittum}, Wittawat and {Kanagawa}, Motonobu}, title = {Large Sample Analysis of the Median Heuristic}, journal = {ArXiv}, eprint = {1707.07269}, keywords = {Mathematics  Statistics Theory, Statistics  Machine Learning, 62E20, 62G30}, year = {2018}, wj_http = {https://arxiv.org/abs/1707.07269} }

A LinearTime Kernel GoodnessofFit Test NeurIPS 2017NeurIPS 2017 Best paper. 3 out of 3240 submissions.A lineartime test of goodness of fit of an unnormalized density function on a dataset. The test can produce evidence indicating where (in the data domain) the model does not fit well.
abstract paper bib poster code talk slides
We propose a novel adaptive test of goodnessoffit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein’s method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a meanshift alternative, our test always has greater relative efficiency than a previous lineartime kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier lineartime test, and matches or exceeds the power of a quadratictime kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratictime twosample test based on the Maximum Mean Discrepancy, with samples drawn from the model.@inproceedings{jitkrittum_lineartime_2017, title = {A LinearTime Kernel GoodnessofFit Test}, url = {http://arxiv.org/abs/1705.07673}, booktitle = {NeurIPS}, author = {Jitkrittum, Wittawat and Xu, Wenkai and Szabo, Zoltan and Fukumizu, Kenji and Gretton, Arthur}, year = {2017}, wj_img = {cover_nips2017.png}, wj_summary = { A lineartime test of goodness of fit of an unnormalized density function on a dataset. The test can produce evidence indicating where (in the data domain) the model does not fit well. }, wj_http = {http://arxiv.org/abs/1705.07673}, wj_code = {https://github.com/wittawatj/kernelgof}, wj_poster = {/assets/poster/kgof_nips2017_poster.pdf}, wj_slides = {/assets/talks/kgof_nips2017_oral.pdf}, wj_talk = {https://www.facebook.com/nipsfoundation/videos/1553635538061013/}, wj_highlight = {NeurIPS 2017 Best paper. 3 out of 3240 submissions.} }

KernelBased Distribution Features for Statistical Tests and Bayesian Inference 2017My PhD Thesis. Gatsby Unit, University College London.
@phdthesis{phdthesis2017, author = {Jitkrittum, Wittawat}, title = {KernelBased Distribution Features for Statistical Tests and {Bayesian} Inference}, school = {University College London}, year = {2017}, month = nov, url = {http://discovery.ucl.ac.uk/10037987/}, wj_http = {http://discovery.ucl.ac.uk/10037987/}, wj_highlight = {My PhD Thesis. Gatsby Unit, University College London.} }

An Adaptive Test of Independence with Analytic Kernel Embeddings ICML 2017
abstract paper bib poster code talk slides
A new computationally efficient dependence measure, and an adaptive statistical test of independence, are proposed. The dependence measure is the difference between analytic embeddings of the joint distribution and the product of the marginals, evaluated at a finite set of locations (features). These features are chosen so as to maximize a lower bound on the test power, resulting in a test that is dataefficient, and that runs in linear time (with respect to the sample size n). The optimized features can be interpreted as evidence to reject the null hypothesis, indicating regions in the joint domain where the joint distribution and the product of the marginals differ most. Consistency of the independence test is established, for an appropriate choice of features. In realworld benchmarks, independence tests using the optimized features perform comparably to the stateoftheart quadratictime HSIC test, and outperform competing O(n) and O(n log n) tests.@inproceedings{pmlrv70jitkrittum17a, title = {An Adaptive Test of Independence with Analytic Kernel Embeddings}, author = {Jitkrittum, Wittawat and Szab{\'o}, Zolt{\'a}n and Gretton, Arthur}, booktitle = {ICML}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, url = {http://proceedings.mlr.press/v70/jitkrittum17a.html}, wj_http = {http://proceedings.mlr.press/v70/jitkrittum17a.html}, wj_code = {https://github.com/wittawatj/fsictest}, wj_slides = {/assets/talks/fsic_icml2017_oral.pdf}, wj_talk = {https://vimeo.com/255244123}, wj_poster = {/assets/poster/fsic_icml2017_poster.pdf} }

Cognitive Bias in Ambiguity Judgements: Using Computational Models to Dissect the Effects of Mild Mood Manipulation in Humans PLOS ONE 2016
@article{iigaya2016, title = {Cognitive Bias in Ambiguity Judgements: Using Computational Models to Dissect the Effects of Mild Mood Manipulation in Humans}, author = {Iigaya, Kiyohito and Jolivald, Aurelie and Jitkrittum, Wittawat and Gilchrist, Iain and Dayan, Peter and Paul, Elizabeth and Mendl, Michael}, year = {2016}, month = oct, journal = {PLOS ONE}, issn = {19326203}, publisher = {Public Library of Science}, wj_http = {http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165840} }

Interpretable Distribution Features with Maximum Testing Power NeurIPS 2016Oral presentation. 1.8% of total submission.
abstract paper bib poster code talk slides
Two semimetrics on probability distributions are proposed, given as the sum of differences of expectations of analytic functions evaluated at spatial or frequency locations (i.e, features). The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound on test power for a statistical test using these features. The result is a parsimonious and interpretable indication of how and where two distributions differ locally. An empirical estimate of the test power criterion converges with increasing sample size, ensuring the quality of the returned features. In realworld benchmarks on highdimensional text and image data, lineartime tests using the proposed semimetrics achieve comparable performance to the stateoftheart quadratictime maximum mean discrepancy test, while returning humaninterpretable features that explain the test results.@inproceedings{fotest2016, author = {Jitkrittum, Wittawat and Szab\'{o}, Zolt\'{a}n and Chwialkowski, Kacper and Gretton, Arthur}, title = {Interpretable Distribution Features with Maximum Testing Power}, booktitle = {NeurIPS}, year = {2016}, url = {http://papers.nips.cc/paper/6148interpretabledistributionfeatureswithmaximumtestingpower}, wj_http = {http://papers.nips.cc/paper/6148interpretabledistributionfeatureswithmaximumtestingpower}, wj_poster = {/assets/poster/fotest_poster.pdf}, wj_code = {https://github.com/wittawatj/interpretabletest}, wj_talk = {https://channel9.msdn.com/Events/NeuralInformationProcessingSystemsConference/NeuralInformationProcessingSystemsConferenceNIPS2016/InterpretableDistributionFeatureswithMaximumTestingPower}, wj_slides = {/assets/talks/fotest_oral.pdf}, wj_highlight = {Oral presentation. 1.8\% of total submission.} }

K2ABC: Approximate Bayesian Computation with Infinite Dimensional Summary Statistics via Kernel Embeddings AISTATS 2016Oral presentation. 6.5% of total submissions.
abstract paper bib poster code slides
Complicated generative models often result in a situation where computing the likelihood of observed data is intractable, while simulating from the conditional density given a parameter value is relatively easy. Approximate Bayesian Computation (ABC) is a paradigm that enables simulationbased posterior inference in such cases by measuring the similarity between simulated and observed data in terms of a chosen set of summary statistics. However, there is no general rule to construct sufficient summary statistics for complex models. Insufficient summary statistics will "leak" information, which leads to ABC algorithms yielding samples from an incorrect (partial) posterior. In this paper, we propose a fully nonparametric ABC paradigm which circumvents the need for manually selecting summary statistics. Our approach, K2ABC, uses maximum mean discrepancy (MMD) as a dissimilarity measure between the distributions over observed and simulated data. MMD is easily estimated as the squared difference between their empirical kernel embeddings. Experiments on a simulated scenario and a realworld biological problem illustrate the effectiveness of the proposed algorithm.@inproceedings{part_k2abc_2015_arxiv, author = {Park, Mijung and Jitkrittum, Wittawat and Sejdinovic, Dino}, title = {{K2ABC}: Approximate {B}ayesian Computation with Infinite Dimensional Summary Statistics via Kernel Embeddings}, booktitle = {AISTATS}, year = {2016}, url = {http://jmlr.org/proceedings/papers/v51/park16.html}, wj_http = {http://jmlr.org/proceedings/papers/v51/park16.html}, wj_poster = {/assets/poster/k2abc_AISTATS2016_poster.pdf}, wj_code = {https://github.com/wittawatj/k2abc}, wj_slides = {/assets/talks/k2abc_AISTATS2016.pdf}, wj_highlight = {Oral presentation. 6.5\% of total submissions.} }

Bayesian Manifold Learning: The Locally Linear Latent Variable Model NeurIPS 2015
abstract paper bib code slides
We introduce the Locally Linear Latent Variable Model (LLLVM), a probabilistic model for nonlinear manifold discovery that describes a joint distribution over observations, their manifold coordinates and locally linear maps conditioned on a set of neighbourhood relationships. The model allows straightforward variational optimisation of the posterior distribution on coordinates and locally linear maps from the latent space to the observation space given the data. Thus, the LLLVM encapsulates the localgeometry preserving intuitions that underlie nonprobabilistic methods such as locally linear embedding (LLE). Its probabilistic semantics make it easy to evaluate the quality of hypothesised neighbourhood relationships, select the intrinsic dimensionality of the manifold, construct outofsample extensions and to combine the manifold model with additional probabilistic models that capture the structure of coordinates within the manifold.@inproceedings{Park2015, author = {Park, Mijung and Jitkrittum, Wittawat and Qamar, Ahmad and Szab\'{o}, Zolt\'{a}n and Buesing, Lars and Sahani, Maneesh}, title = {Bayesian Manifold Learning: The Locally Linear Latent Variable Model}, booktitle = {NeurIPS}, year = {2015}, url = {http://arxiv.org/abs/1410.6791}, wj_http = {http://arxiv.org/abs/1410.6791}, wj_code = {https://github.com/mijungi/lllvm}, wj_slides = {/assets/talks/csml_lllvm.pdf} }

KernelBased JustInTime Learning for Passing Expectation Propagation Messages UAI 2015
abstract paper bib poster code
We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression. We use kernelbased regression, which is trained on a set of probability distributions representing the incoming messages, and the associated outgoing messages. The kernel approach has two main advantages: first, it is fast, as it is implemented using a novel twolayer random feature representation of the input message distributions; second, it has principled uncertainty estimates, and can be cheaply updated online, meaning it can request and incorporate new training data when it encounters inputs on which it is uncertain. In experiments, our approach is able to solve learning problems where a single message operator is required for multiple, substantially different data sets (logistic regression for a variety of classification problems), where it is essential to accurately assess uncertainty and to efficiently and robustly update the message operator.@inproceedings{jitkrittum_kernelbased_2015, title = {KernelBased JustInTime Learning for Passing Expectation Propagation Messages}, author = {Jitkrittum, Wittawat and Gretton, Arthur and Heess, Nicolas and Eslami, S. M. Ali and Lakshminarayanan, Balaji and Sejdinovic, Dino and Szab\'{o}, Zolt\'{a}n}, url = {http://arxiv.org/abs/1503.02551}, booktitle = {UAI}, year = {2015}, wj_http = {http://arxiv.org/abs/1503.02551}, wj_poster = {/assets/poster/kjit_uai2015_poster.pdf}, wj_code = {https://github.com/wittawatj/kernelep} }

Performance of synchrony and spectralbased features in early seizure detection: exploring feature combinations and effect of latency International Workshop on Seizure Prediction (IWSP) 2015: Epilepsy Mechanisms, Models, Prediction and Control 2015
@misc{adam+al:2015:iwsp7, author = {Adam, Vincent and SoldadoMagraner, Joana and Jitkrittum, Wittawat and Strathmann, Heiko and Lakshminarayanan, Balaji and Ialongo, Alessandro Davide and Bohner, Gergo and Huh, Ben Dongsung and Goetz, Lea and Dowling, Shaun and Serban, Iulian Vlad and Louis, Matthieu}, title = {Performance of synchrony and spectralbased features in early seizure detection: exploring feature combinations and effect of latency}, booktitle = {International Workshop on Seizure Prediction (IWSP) 2015: Epilepsy Mechanisms, Models, Prediction and Control}, year = {2015}, wj_http = {http://www.iwsp7.org/}, wj_code = {https://github.com/vincentadam87/gatsbyhackathonseizure} }

HighDimensional Feature Selection by FeatureWise Kernelized Lasso Neural Computation 2014The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a featurewise kernelized Lasso for capturing nonlinear inputoutput dependency. We first show that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernelbased independence measures such as the HilbertSchmidt independence criterion. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to highdimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.
@article{YamadaJSXS14, author = {Yamada, Makoto and Jitkrittum, Wittawat and Sigal, Leonid and Xing, Eric P. and Sugiyama, Masashi}, title = {HighDimensional Feature Selection by FeatureWise Kernelized Lasso}, journal = {Neural Computation}, volume = {26}, number = {1}, year = {2014}, pages = {185207}, url = {http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00537#.U9O7Idtsylg}, wj_http = {http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00537#.U9O7Idtsylg}, wj_code = {http://www.makotoyamadaml.com/hsiclasso.html} }

Feature Selection via L1Penalized SquaredLoss Mutual Information IEICE Transactions 2013Feature selection is a technique to screen out less important features. Many existing supervised feature selection algorithms use redundancy and relevancy as the main criteria to select features. However, feature interaction, potentially a key characteristic in realworld problems, has not received much attention. As an attempt to take feature interaction into account, we propose L1LSMI, an L1regularization based algorithm that maximizes a squaredloss variant of mutual information between selected features and outputs. Numerical results show that L1LSMI performs well in handling redundancy, detecting nonlinear dependency, and considering feature interaction.
@article{Jitkrittum2013, author = {Jitkrittum, Wittawat and Hachiya, Hirotaka and Sugiyama, Masashi}, title = {Feature Selection via L1Penalized SquaredLoss Mutual Information}, journal = {IEICE Transactions}, year = {2013}, volume = {96D}, pages = {15131524}, number = {7}, wj_pdf = {http://wittawat.com/pages/files/L1LSMI.pdf}, wj_code = {https://github.com/wittawatj/l1lsmi} }

Squaredloss Mutual Information Regularization: A Novel Informationtheoretic Approach to Semisupervised Learning ICML 2013We propose squaredloss mutual information regularization (SMIR) for multiclass probabilistic classification, following the information maximization principle. SMIR is convex under mild conditions and thus improves the nonconvexity of mutual information regularization. It offers all of the following four abilities to semisupervised algorithms: Analytical solution, outofsample/multiclass classification, and probabilistic output. Furthermore, novel generalization error bounds are derived. Experiments show SMIR compares favorably with stateoftheart methods.
@inproceedings{Niu2013, author = {Niu, Gang and Jitkrittum, Wittawat and Dai, Bo and Hachiya, Hirotaka and Sugiyama, Masashi}, title = {Squaredloss Mutual Information Regularization: A Novel Informationtheoretic Approach to Semisupervised Learning}, booktitle = { ICML}, year = {2013}, volume = {28}, pages = {1018}, url = {http://jmlr.org/proceedings/papers/v28/niu13.pdf}, wj_pdf = {http://jmlr.org/proceedings/papers/v28/niu13.pdf}, wj_code = {https://github.com/wittawatj/smir} }

QAST: Question Answering System for Thai Wikipedia Proceedings of the 2009 Workshop on Knowledge and Reasoning for Answering Questions 2009We propose an opendomain question answering system using Thai Wikipedia as the knowledge base. Two types of information are used for answering a question: (1) structured information extracted and stored in the form of Resource Description Framework (RDF), and (2) unstructured texts stored as a search index. For the structured information, SPARQL transformed query is applied to retrieve a short answer from the RDF base. For the unstructured information, keywordbased query is used to retrieve the shortest text span containing the questions’s key terms. From the experimental results, the system which integrates both approaches could achieve an average MRR of 0.47 based on 215 test questions.
@inproceedings{Jitkrittum2009, author = {Jitkrittum, Wittawat and Haruechaiyasak, Choochart and Theeramunkong, Thanaruk}, title = {{QAST}: Question Answering System for {Thai} Wikipedia}, booktitle = {Proceedings of the 2009 Workshop on Knowledge and Reasoning for Answering Questions}, year = {2009}, series = {KRAQ '09}, pages = {1114}, publisher = {Association for Computational Linguistics}, url = {http://dl.acm.org/citation.cfm?id=1697288.1697291}, wj_http = {http://dl.acm.org/citation.cfm?id=1697288.1697291} }

Implementing News Article Category Browsing Based on Text Categorization Technique Web Intelligence/IAT Workshops 2008We propose a feature called category browsing to enhance the fulltext search function of Thailanguage news article search engine. The category browsing allows users to browse and filter search results based on some predefined categories. To implement the category browsing feature, we applied and compared among several text categorization algorithms including decision tree, Naive Bayes (NB) and Support Vector Machines (SVM). To further increase the performance of text categorization, we performed evaluation among many feature selection techniques including document frequency thresholding (DF), information gain (IG) and x2 (CHI). Based on our experiments using a large news corpus, the SVM algorithm with the IG feature selection yielded the best performance with the F1 measure equal to 95.42%.
@inproceedings{Haruechaiyasak2008, author = {Haruechaiyasak, Choochart and Jitkrittum, Wittawat and Sangkeettrakarn, Chatchawal and Damrongrat, Chaianun}, title = {Implementing News Article Category Browsing Based on Text Categorization Technique}, booktitle = {Web Intelligence/IAT Workshops}, year = {2008}, pages = {143146}, ee = {http://dx.doi.org/10.1109/WIIAT.2008.61}, wj_http = {http://dx.doi.org/10.1109/WIIAT.2008.61} }

ProximityBased Semantic Relatedness Measurement on Thai Wikipedia International Conference on Knowledge, Information and Creativity Support Systems (KICSS) 2008
@inproceedings{proximity2008, author = {Jitkrittum, Wittawat and Theeramunkong, Thanaruk and Haruechaiyasak, Choochart}, title = {ProximityBased Semantic Relatedness Measurement on {Thai} {Wikipedia}}, booktitle = {International Conference on Knowledge, Information and Creativity Support Systems (KICSS)}, year = {2008} }

Managing Offline Educational Web Contents with Search Engine Tools International Conference on Asian Digital Libraries 2007In this paper, we describe our ongoing project to help alleviate the digital divide problem among high schools in rural areas of Thailand. The idea is to select, organize, index and distribute useful educational Web contents to schools where the Internet connection is not available. These Web contents can be used by teachers and students to enhance the teaching and learning for many class subjects. We have collaborated with a group of teachers from different high schools in order to gather the requirements for designing our software tools. One of the challenging issues is the variation in computer hardwares and network configuration found in different schools. Some shools have PCs connected to the school’s server via the Local Area Network (LAN). While some other schools have lowperformance PCs without any network connection. To support both cases, we provide two solutions via two different search engine tools. These tools support content administrators, e.g., teachers, with the features to organize and index the contents. The tools also provide general users with the features to browse and search for needed information. Since the contents and index are locally stored on hard disk or some removable media such as CDROM, the Internet connection is not needed.
@inproceedings{Haruechaiyasak2007, author = {Haruechaiyasak, Choochart and Sangkeettrakarn, Chatchawal and Jitkrittum, Wittawat}, title = {Managing Offline Educational Web Contents with Search Engine Tools}, booktitle = {International Conference on Asian Digital Libraries}, year = {2007}, pages = {444453}, ee = {http://dx.doi.org/10.1007/9783540770947_56}, wj_http = {http://dx.doi.org/10.1007/9783540770947_56} }