Howdy! Welcome to my home page.
I am a second-year M.S. student at Xidian University, advised by Prof. Bo Chen. Concurrently, I serve as a Research Intern at Stony Brook University, working with Prof. Chenyu You. Prior to my graduate studies, I received my B.S. degree from Xidian University in 2023.
🔥 I am actively seeking for a PhD in 26Fall in US.
Please feel free to reach out to me via email if you believe I am a good fit for your research team.
I welcome the opportunity for further discussion! Please see my
CV for more details.
🧐 Research Interests
My primary research goal is to develop scalable, reliable and efficient methods for machine learning and generative AI, mainly focus at:
Bayesian methods for disentangled representations and uncertainty estimation
Alignment and safety of Foundation models, including LLMs, VLMs, and diffusion models
In addition, I am also highly interested in:
📚 Memorization in large models
🔄 Self-consuming/self-improving loops
🤖 Agent learning with Foundation models
If you share the same research interests, feel free to reach out or add my
Wechat
🚀🚀 News
[03/2025] Code for our paper CSR has been released,
and we were invited to publish the model on Hugging Face! ⚙️⚙️
[02/2025] One paper was accepted by CVPR 2025! 🎉🎉
[07/2024] Our paper HICE-Score was accepted by ACM MM 2024! 🎉🎉
Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval,
search, and generative modeling.
Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it
requires full model retraining and suffers from noticeable performance degradations at short lengths.
In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation
with minimal overhead and higher fidelity.
We propose Contrastive Sparse Representation (CSR), a method that sparsifies pre-trained embeddings into a
high-dimensional but selectively activated feature space.
By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic
quality while allowing flexible, cost-effective inference at different sparsity levels.
Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently
outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting
training time to a fraction of that required by MRL.
Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world
applications where efficiency and fidelity are both paramount.
Factor analysis, often regarded as a Bayesian variant of matrix factorization,
offers superior capabilities in capturing uncertainty, modeling complex dependencies,
and ensuring robustness. As the deep learning era arrives, factor analysis is receiving
less and less attention due to their limited expressive ability. On the contrary,
contrastive learning has emerged as a potent technique with demonstrated efficacy
in unsupervised representational learning. While the two methods are different paradigms,
recent theoretical analysis has revealed the mathematical equivalence between contrastive
learning and matrix factorization, providing a potential possibility for factor analysis
combined with contrastive learning. Motivated by the interconnectedness of contrastive
learning, matrix factorization, and factor analysis, this paper introduces a novel
Contrastive Factor Analysis framework, aiming to leverage factor analysis's advantageous
properties within the realm of contrastive learning. To further leverage the interpretability
properties of non-negative factor analysis, which can learn disentangled representations,
contrastive factor analysis is extended to a non-negative version. Finally, extensive
experimental validation showcases the efficacy of the proposed contrastive (non-negative)
factor analysis methodology across multiple key properties, including expressiveness,
robustness, interpretability, and accurate uncertainty estimation.
Image captioning evaluation metrics can be divided into two categories, reference-based metrics and reference-free metrics. However, reference-based approaches may struggle to evaluate descriptive captions with abundant visual details produced by advanced multimodal large language models, due to their heavy reliance on limited human-annotated references. In contrast, previous reference-free metrics have been proven effective via CLIP cross-modality similarity. Nonetheless, CLIP-based metrics, constrained by their solution of global image-text compatibility, often have a deficiency in detecting local textual hallucinations and are insensitive to small visual objects. Besides, their single-scale designs are unable to provide an interpretable evaluation process such as pinpointing the position of caption mistakes and identifying visual regions that have not been described. To move forward, we propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S). By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism, breaking through the barriers of the single-scale structure of existing reference-free metrics. Comprehensive experiments indicate that our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics like CLIP-S and PAC-S, and reference-based metrics like METEOR and CIDEr. Moreover, several case studies reveal that the assessment process of HICE-S on detailed captions closely resembles interpretable human judgments.
Our code is available at https://github.com/joeyz0z/HICE.
The gamma belief network (GBN), often regarded as a deep topic model, has demonstrated its potential for uncovering multi-layer interpretable latent representations in text data. Its notable capability to acquire interpretable latent factors is partially attributed to sparse and non-negative gamma-distributed latent variables. However, the existing GBN and its variations are constrained by the linear generative model, thereby limiting their expressiveness and applicability. To address this limitation, we introduce the generalized gamma belief network (Generalized GBN) in this paper, which extends the original linear generative model to a more expressive non-linear generative model. Since the parameters of the Generalized GBN no longer possess an analytic conditional posterior, we further propose an upward-downward Weibull inference network to approximate the posterior distribution of the latent variables. The parameters of both the generative model and the inference network are jointly trained within the variational inference framework. Finally, we conduct comprehensive experiments on both expressivity and disentangled representation learning tasks to evaluate the performance of the Generalized GBN against state-of-the-art Gaussian variational autoencoders serving as baselines.