Total: 1
Natural language understanding over tabular data has played a significant role in data discovery tasks such as joinable and unionable table search. State-of-the-art approaches adopt large language models (LLMs) pre-trained over massive text corpora to learn and evaluate the table semantic relatedness. Existing methods typically follow a pretrain-and-finetune paradigm, namely fine-tuning an LLM using tabular data with table relatedness labels. To enhance model’s understanding of tabular data, recent studies include auxiliary tasks such as entity resolution and column type classification in the fine-tuning phase. In spite of achieving performance gain from these supervisions, there is a lack of study on how these supervisions complement or even contrast each other, leading to a subpar performance on the final data discovery tasks. In this paper, we propose a simple yet effective multi-task fine-tuning framework named DiscoverGPT that holistically discovers and leverages the intricate relationships among the supervisions to optimize the performance on the data discovery task. Moreover, DiscoverGPT is plug-and-play that allows a broad range of open-domain auxiliary tasks to be incorporated, by utilizing the generative power of LLMs. We demonstrate the usability and effectiveness of DiscoverGPT with baseline comparisons and ablation studies. DiscoverGPT outperforms the best performing baseline by up to 7% in F1 score.