Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned Loading

  1. natural-questions natural-questions Public

    Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 897 151

  2. conceptual-captions conceptual-captions Public

    Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 498 24

  3. Objectron Objectron Public

    Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the came…

    Jupyter Notebook 2.2k 264

  4. wit wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    968 39

  5. paws paws Public

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase ident…

    Python 541 52

  6. dstc8-schema-guided-dialogue dstc8-schema-guided-dialogue Public

    The Schema-Guided Dialogue Dataset

    Python 528 124

Repositories

Showing 10 of 157 repositories
  • google-research-datasets/tpu_graphs’s past year of commit activity
    C++ 120 Apache-2.0 43 2 1 Updated Jun 25, 2024
  • MISeD Public

    MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.

    google-research-datasets/MISeD’s past year of commit activity
    3 1 0 0 Updated Jun 25, 2024
  • richhf-18k Public

    RichHF-18K dataset contains rich human feedback labels we collected for our CVPR'24 paper: https://arxiv.org/pdf/2312.10240, along with the file name of the associated labeled images (no urls or images are included in this dataset).

    google-research-datasets/richhf-18k’s past year of commit activity
    47 1 7 0 Updated Jun 25, 2024
  • web-images Public

    Images gathered from the Internet in 2023 and some metadata

    google-research-datasets/web-images’s past year of commit activity
    HTML 0 0 0 0 Updated Jun 24, 2024
  • GeniL Public

    GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.

    google-research-datasets/GeniL’s past year of commit activity
    0 CC-BY-4.0 0 0 0 Updated Jun 18, 2024
  • adversarial-nibbler Public

    This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).

    google-research-datasets/adversarial-nibbler’s past year of commit activity
    11 CC-BY-4.0 1 0 0 Updated Jun 12, 2024
  • ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.

    google-research-datasets/ToTTo’s past year of commit activity
    424 34 6 0 Updated May 28, 2024
  • D3code Public

    D3code is a large-scale cross-cultural dataset of parallel annotations for offensive language detection by over 4k annotators, balanced across gender and age, from across 21 countries, representing eight geo-cultural regions.

    google-research-datasets/D3code’s past year of commit activity
    0 CC-BY-4.0 1 0 0 Updated May 22, 2024
  • visage Public

    Visage contains an image dataset of images with human annotations on whether or not certain attributes are present or depicted in the image. The attribute may either be stereotypical or non-stereotypical w.r.t. to the identity group in the image. It also contains a list of attributes in English along with annotations about whether they are visual.

    google-research-datasets/visage’s past year of commit activity
    3 Apache-2.0 0 0 0 Updated May 17, 2024
  • google-research-datasets/sanpo_dataset’s past year of commit activity
    Python 39 Apache-2.0 1 3 1 Updated May 16, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Most used topics

Loading…