SeCo-600 Natural Language Query Dataset

This page links to the SeCo-600 dataset of natural language multi-focus queries used for the experiments reported in the LREC 2012 paper "Evaluating Multi-focus Natural Language Queries over Data Services" by Silvia Quarteroni, Vincenzo Guerrisi and Pietro La Torre.

Abstract

Natural language interfaces to data services will be a key technology to guarantee access to huge data repositories in an effortless way. This involves solving the complex problem of recognizing a relevant service or service composition given an ambiguous, potentially ungrammatical natural language question. As a first step toward this goal, we study methods for identifying the salient terms (or foci) in natural language questions, classifying the latter according to a taxonomy of services and extracting additional relevant information in order to route them to suitable data services. While current approaches deal with single-focus (and therefore single-domain) questions, we investigate multi-focus questions in the aim of supporting conjunctive queries over the data services they refer to. Since such complex queries have seldom been studied in the literature, we have collected an ad-hoc dataset, SeCo-600, containing 600 multi-domain queries annotated with a number of linguistic and pragmatic features. Our experiments with the dataset have allowed us to reach very high accuracy in different phases of query analysis, especially when adopting machine learning methods.

BibTeX

@InProceedings{QUARTERONI12.790,

  author = {Silvia Quarteroni and Vincenzo Guerrisi and Pietro La Torre},
  title = {Evaluating Multi-focus Natural Language Queries over Data Services},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}

}

SeCo-600 dataset files

SeCo-600 is distributed in two formats: a textual format containing one natural language query per line and a column format containing various annotation levels for each query (see the paper for details). It is free for usage and redistribution although we ask you to kindly refer to the paper when doing so.