Research

Our publications in tabular AI and synthetic data generation

2025 In Preparation

Synthetic Data for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation

Swahili, a vital African lingua franca, remains under-resourced in Natural Language Processing (NLP), limiting technological progress for its more than 100 million speakers. To help close this gap, we introduce a controllable synthetic data pipeline that generates culturally grounded Swahili text and evaluates it using three LLM judges across linguistic quality, cultural relevance, sentiment alignment, and instruction adherence. To validate the reliability of these LLM-based evaluations, a native Swahili speaker conducts a targeted human assessment on a representative subset of 200 samples, confirming the judges' accuracy and highlighting areas for calibration. The pipeline defines fine-grained generation criteria, produces diverse candidate outputs from multiple LLMs, and filters aggressively based on the judges' scores to retain only high-quality samples. Using the resulting corpus, we continue fine-tuning multilingual sentiment classifiers and observe consistent macro–F1 improvements on AfriSenti–Swahili over zero-shot baselines. These findings demonstrate that LLM-judged synthetic supervision, supported by human evaluation, can reliably transfer sentiment capability to a low-resource language. We release the dataset and models to support future work on Swahili and other African languages.

Authors: Samuel Gyamfi , Alfred Kondoro, Yankı Öztür, Richard H. Schreiber, Vadim Borisov
2025 Dagstuhl Seminar

Unlocking the Full Potential of Data Science Requires Tabular Foundation Models, Agents, and Humans

Despite its vast potential, data science remains constrained by manual workflows and fragmented tools. Meanwhile, foundation models have transformed natural language and computer vision — and are beginning to bring similar breakthroughs to structured data, particularly the ubiquitous tabular data central to data science. At the same time, there are strong claims that fully autonomous agentic data science systems will emerge. We argue that, rather than replacing data scientists, the future of data science lies in a new paradigm that amplifies their impact: collaborative systems that tightly integrate agents and tabular foundation models (TFMs) with human experts. In this paper, we discuss the potential and challenges of navigating the interplay between these three and present a research agenda to guide this disruption toward a more accessible, robust, and human-centered data science.

Authors: Tianji Cong, Julian Martin Eisenschlos, Daniel Gomm, Leo Grinsztajn, Andreas C Mueller, Anupam Sanghi, Jan-Micha Bodensohn, Vadim Borisov, Michael Cochez, Katharina Eggensperger, Floris Geerts, Myung Jun Kim, Andreas Kipf, Xue Li, Olga Ovcharenko, Paolo Papotti, Lennart Purucker, Sebastian Schelter, Immanuel Trummer, Gaël Varoquaux, Liane Vogel, Carsten Binnig, Madelon Hulsebos, Frank Hutter
Read Paper →
2024 ICML Workshop

Open Artificial Knowledge

The tremendous success of chat-based AI systems like ChatGPT, Claude, and Gemini stems from Large Language Models (LLMs) trained on vast amount of datasets. However, acquiring high-quality, diverse, and ethically sourced training data remains a significant challenge. We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens (at the moment of writing) designed to address this issue. OAK leverages an ensemble of state-of-the-art LLMs, including GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B , to generate high-quality text across diverse domains, guided by Wikipedia's main categories. Our methodology ensures broad knowledge coverage while maintaining coherence and factual accuracy. The OAK dataset aims to foster the development of more capable and aligned language models while addressing critical issues of data scarcity and privacy in LLM training, and it is freely available on www.oakdataset.org.

Authors: Vadim Borisov, Richard H. Schreiber
Read Paper →
2023 ICLR

Language Models are Realistic Tabular Data Generators

Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across numerous real-world and synthetic data sets with heterogeneous feature types coming in various sizes.

Authors: Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci
Code & Paper →