Get in touch with us at behavior-in-the-wild@googlegroups.com
In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque oracles: practitioners must be able to inspect the features driving model decisions, and models must be able to leverage the expert documentation already governing these domains. This requires features discovered from raw text and images to be interpretable, discriminative, and aligned with what experts consider important. Existing methods fall short: they target tabular inputs, lack demonstrated expert alignment, and cannot operationalize qualitative criteria such as “maintain professional tone” into precise features. To address these challenges, we present FEST (Feature Engineering with Self-evolving Trees), which combines dual-stream feature generation (semantic and deterministic), semantic deduplication, and tree-guided iterative evolution to discover features directly from unstructured data. FEST leads in 17 of 20 classifier-task combinations across brand classification (text and images), content authenticity detection, and stress detection, with a mean gain of 4.2 pp over the strongest baseline across five classifiers. An LLM-as-judge evaluation shows FEST achieves 60–80% coverage of expert-designed brand features at strict semantic-alignment thresholds, corroborated by a human expert study rating FEST features highly on relevance, clarity, and actionability. When seeded with expert guidelines, FEST refines qualitative criteria into precise, operational features, improving downstream accuracy by 6–12 pp on average across brands. To enable systematic evaluation of expert alignment in automated feature engineering, we release BrandGuide, the first dataset pairing expert-designed features with 1M+ assets across 2,683 brands. By grounding automated feature engineering in expert knowledge, FEST opens a practical pathway for deploying interpretable ML in domains that demand human oversight and accountability.
FEST is evaluated across brand classification (text and images), content authenticity detection, and stress detection using five classifiers (DT, LR, RF, MLP, XGB). Accuracy below is averaged across all five classifiers; per-classifier breakdowns in Appendix E.
| Method | Brand Cl. (Text) | Brand Cl. (Images) | Content Auth. | Stress Det. |
|---|---|---|---|---|
| Zero-Shot LLM | 75.6 | 70.6 | 79.8 | 73.3 |
| Few-Shot LLM | 77.8 | 74.7 | 73.9 | 72.8 |
| Felix | 78.1 | 69.7 | 87.5 | 79.1 |
| FEST (Ours) | 82.9 | 79.3 | 91.0 | 80.5 |
Using brand style guidelines as seed features, FEST operationalizes qualitative criteria into precise, measurable features and discovers complementary patterns. The chart below disentangles the contributions of refinement and augmentation across three brands, averaged over DT, LR, RF, and LLM classifiers.
@misc{khurana2026bridgingexpertknowledgeautomated,
title={Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution},
author={Varun Khurana and Vijval Ekbote and Vashu Chauhan and Yaman Kumar Singla and Rajiv Ratn Shah and Balaji Krishnamurthy},
year={2026},
eprint={2606.08800},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.08800},
}