Export Datasets

Export Datasets

client.download_dataset(
    my_dataset.id,
    new_dataset_version.id,
    split_id="",
    extract_to_dir="data",
    download_file="dataset.zip",
    remove_download_file=True,
    export_format=""
)
ParameterTypeDescription
dataset_idstrThe dataset id
dataset_version_idstrThe dataset version id
split_idstr(Optional) Specify the split_id if you only want to download that dataset split.
extract_to_dirstrThe directory to extract to. Default value: “data”
download_filestrThe name of the download file. Default value: “dataset.zip”
remove_download_fileboolFlag indicating whether to remove or keep the downloaded zip file. Default value: True
export_formatDatasetFormatThe format of your dataset.
Supported export formats: FOLDERS, YOLO, CSV, SPACY_NER

A note on the exported formats.

Image classification

export_format: FOLDERS

The .zip file contains a folder for every dataset split: “train”, “valid”, “test”.

Every dataset split folder contains a ’label-named’ folder for every label: “cats”, “dogs”.

Every label folder contains all the images of the dataset items for the given label in the given dataset split.

Dataset items are named by the “name” and “extension” property of the dataset item. If the “name” property is empty, the “id” is used to name the file.

Dataset items that have no label, will be added to the split folder to which they belong.

.
+-- train/
|  +-- cats/
|     +-- cat1.jpg
|     +-- cat12.jpeg
|  +-- dogs/
|     +-- dog2.jpg
|     +-- dog4.png
|  +-- cat17.jpeg
|  +-- dog15.jpg
+-- valid/
|  +-- cats/
|     +-- cat4.jpg
|     +-- cat8.jpg
|  +-- dogs/
|     +-- dog9.jpg
|     +-- dog14.png
+-- test/
|  +-- cats/
|     +-- cat90.jpg
|     +-- cat34.jpeg
|  +-- dogs/
|     +-- dog81.jpg
|     +-- dog98.png

Text classification

export_format: FOLDERS

The .zip file contains a folder for every dataset split in your dataset, e.g. “train”, “test”, “unsup”.

Every dataset split folder contains a ’label-named’ folder for every label: “pos”, “neg”.

Every label folder contains all the images of the dataset items for the given label in the given dataset split.

Dataset items are named by the “name” and “extension” property of the dataset item. If the “name” property is empty, the “id” is used to name the file.

Dataset items that have no label, will be added to the split folder to which they belong.

.
+-- train/
|  +-- pos/
|     +-- 1.txt
|     +-- 3.txt
|  +-- neg/
|     +-- 2.txt
|     +-- 4.txt
+-- test/
|  +-- pos/
|     +-- 5.txt
|     +-- 7.txt
|  +-- neg/
|     +-- 6.txt
|     +-- 8.txt
+-- unsup/
|     +-- 13.txt
|     +-- 14.txt

Object detection

export_format: YOLO

Object detection datasets are exported in YOLO v4 format.

For every dataset split a dataset_split_name.txt file gets created containing all the filenames for that dataset split.

Every dataset item will have an image and a txt file associated with it. The txt file contains a list of annotations in Yolo format: label_index relative_x relative_y relative_width relative_height.

The .names file contains the list of labels, where the index corresponds to the label_index in the annotation .txt files.

The config.json file contains a contains a json object with the color for every label.

.
+-- train.txt
+-- valid.txt
+-- test.txt
+-- 1.jpg
+-- 1.txt
+-- 3.jpg
+-- 3.txt
+-- ...
+-- `dataset_version_id`.names
+-- config.json

A little more detail on the config.json file:

{ 
    "colors": { 
        "label_1": "#36dfd4" , 
        "label_2": "#f0699e" 
    }
}

Tabular

export_format: CSV

Tabular datasets are exported in a .zip file that contains a dataset_version_id.csv file accompanied by a config.json, which provides more details on how the data should be interpreted.

.
+-- dataset_version_id.csv
+-- config.json

A little more details about the config.json file:

{
    "multi_label": false,
    "label_column": "labels",
    "split_column": "split",
    "label_separator": " ",
    "filename": "dataset_version_id.csv",
    "csv_separator": ","
}
ParameterTypeDescription
multi_labelboolIs the dataset multi label?
label_columnstrThe column name that contains the labels
split_columnstrThe column name that contains the name of the split the row belongs to
label_separatorstrIf multi_label, use this separator to split the labels
filenamestrThe name of the .csv file that contains the data
csv_separatorstrUse this separator to split each row into columns

Named entity recognition

export_format: SPACY_NER

For an named entity recognition dataset with splits:

  • train
  • valid
  • test

the zip file should be structured in the following way:

.
+-- train.json
+-- valid.json
+-- test.json
+-- config.json
config.json

The config.json file contains a list of dataset splits, as well as a color code for every label.

{
    "splits": [
        "train",
        "valid",
        "test"
    ],
    "colors": {
        "label_name": "#82ebfd",
        "label_name2": "#e95211"
    }
}
split_name.json

For every dataset split, there is a ‘split_name’.json file with the following structure:

[
    {
        "id": "the_dataset_item_id",
        "name": "the_original_filename" ,
        "text": "The textual content of the file that has been annotated.",
        "annotations": [{
            "start": 4,
            "end": 11,
            "label": "label_name",
        },
         {
             ...
         }
        ]
    },
    {
        ...
    }
]