Export Datasets

client.download_dataset(
    my_dataset.id,
    new_dataset_version.id,
    split_id="",
    extract_to_folder="data",
    download_file="dataset.zip",
    remove_download_file=True,
    export_format=""
)

Parameter	Type	Description
dataset_id	str	The dataset id
dataset_version_id	str	The dataset version id
split_id	str	(Optional) Specify the split_id if you only want to download that dataset split.
extract_to_folder	str	The folder to extract to. Default value: “data”
download_file	str	The name of the download file. Default value: “dataset.zip”
remove_download_file	bool	Flag indicating whether to remove or keep the downloaded zip file. Default value: True
export_format	DatasetFormat	The format of your dataset. Supported export formats: FOLDERS, YOLO, CSV, SPACY_NER

A note on the exported formats.

Image classification

export_format: FOLDERS

The .zip file contains a folder for every dataset split: “train”, “valid”, “test”.

Every dataset split folder contains a ’label-named’ folder for every label: “cats”, “dogs”.

Every label folder contains all the images of the dataset items for the given label in the given dataset split.

Dataset items are named by the “name” and “extension” property of the dataset item. If the “name” property is empty, the “id” is used to name the file.

Dataset items that have no label, will be added to the split folder to which they belong.

.
+-- train/
|  +-- cats/
|     +-- cat1.jpg
|     +-- cat12.jpeg
|  +-- dogs/
|     +-- dog2.jpg
|     +-- dog4.png
|  +-- cat17.jpeg
|  +-- dog15.jpg
+-- valid/
|  +-- cats/
|     +-- cat4.jpg
|     +-- cat8.jpg
|  +-- dogs/
|     +-- dog9.jpg
|     +-- dog14.png
+-- test/
|  +-- cats/
|     +-- cat90.jpg
|     +-- cat34.jpeg
|  +-- dogs/
|     +-- dog81.jpg
|     +-- dog98.png

Text classification

export_format: FOLDERS

The .zip file contains a folder for every dataset split in your dataset, e.g. “train”, “test”, “unsup”.

Every dataset split folder contains a ’label-named’ folder for every label: “pos”, “neg”.

Every label folder contains all the images of the dataset items for the given label in the given dataset split.

Dataset items are named by the “name” and “extension” property of the dataset item. If the “name” property is empty, the “id” is used to name the file.

Dataset items that have no label, will be added to the split folder to which they belong.

.
+-- train/
|  +-- pos/
|     +-- 1.txt
|     +-- 3.txt
|  +-- neg/
|     +-- 2.txt
|     +-- 4.txt
+-- test/
|  +-- pos/
|     +-- 5.txt
|     +-- 7.txt
|  +-- neg/
|     +-- 6.txt
|     +-- 8.txt
+-- unsup/
|     +-- 13.txt
|     +-- 14.txt

Object detection

export_format: YOLO

Object detection datasets are exported in YOLO v4 format.

For every dataset split a dataset_split_name.txt file gets created containing all the filenames for that dataset split.

Every dataset item will have an image and a txt file associated with it. The txt file contains a list of annotations in Yolo format: label_index relative_x relative_y relative_width relative_height.

The .names file contains the list of labels, where the index corresponds to the label_index in the annotation .txt files.

The config.json file contains a contains a json object with the color for every label.

.
+-- train.txt
+-- valid.txt
+-- test.txt
+-- 1.jpg
+-- 1.txt
+-- 3.jpg
+-- 3.txt
+-- ...
+-- `dataset_version_id`.names
+-- config.json

A little more detail on the config.json file:

{ 
    "colors": { 
        "label_1": "#36dfd4" , 
        "label_2": "#f0699e" 
    }
}

Tabular

export_format: CSV

Tabular datasets are exported in a .zip file that contains a dataset_version_id.csv file accompanied by a config.json, which provides more details on how the data should be interpreted.

.
+-- dataset_version_id.csv
+-- config.json

A little more details about the config.json file:

{
    "multi_label": false,
    "label_column": "labels",
    "split_column": "split",
    "label_separator": " ",
    "filename": "dataset_version_id.csv",
    "csv_separator": ","
}

Parameter	Type	Description
multi_label	bool	Is the dataset multi label?
label_column	str	The column name that contains the labels
split_column	str	The column name that contains the name of the split the row belongs to
label_separator	str	If `multi_label`, use this separator to split the labels
filename	str	The name of the .csv file that contains the data
csv_separator	str	Use this separator to split each row into columns

Named entity recognition

export_format: SPACY_NER

For an named entity recognition dataset with splits:

train
valid
test

the zip file should be structured in the following way:

.
+-- train.json
+-- valid.json
+-- test.json
+-- config.json

config.json

The config.json file contains a list of dataset splits, as well as a color code for every label.

{
    "splits": [
        "train",
        "valid",
        "test"
    ],
    "colors": {
        "label_name": "#82ebfd",
        "label_name2": "#e95211"
    }
}

split_name.json

For every dataset split, there is a ‘split_name’.json file with the following structure:

[
    {
        "id": "the_dataset_item_id",
        "name": "the_original_filename" ,
        "text": "The textual content of the file that has been annotated.",
        "annotations": [{
            "start": 4,
            "end": 11,
            "label": "label_name",
        },
         {
             ...
         }
        ]
    },
    {
        ...
    }
]

Import Datasets Models