Export Datasets
client.download_dataset(
my_dataset.id,
new_dataset_version.id,
split_id="",
extract_to_folder="data",
download_file="dataset.zip",
remove_download_file=True,
export_format=""
)| Parameter | Type | Description |
|---|---|---|
| dataset_id | str | The dataset id |
| dataset_version_id | str | The dataset version id |
| split_id | str | (Optional) Specify the split_id if you only want to download that dataset split. |
| extract_to_folder | str | The folder to extract to. Default value: “data” |
| download_file | str | The name of the download file. Default value: “dataset.zip” |
| remove_download_file | bool | Flag indicating whether to remove or keep the downloaded zip file. Default value: True |
| export_format | DatasetFormat | The format of your dataset. Supported export formats: FOLDERS, YOLO, CSV, SPACY_NER |
A note on the exported formats.
Image classification
export_format: FOLDERS
The .zip file contains a folder for every dataset split: “train”, “valid”, “test”.
Every dataset split folder contains a ’label-named’ folder for every label: “cats”, “dogs”.
Every label folder contains all the images of the dataset items for the given label in the given dataset split.
Dataset items are named by the “name” and “extension” property of the dataset item. If the “name” property is empty, the “id” is used to name the file.
Dataset items that have no label, will be added to the split folder to which they belong.
.
+-- train/
| +-- cats/
| +-- cat1.jpg
| +-- cat12.jpeg
| +-- dogs/
| +-- dog2.jpg
| +-- dog4.png
| +-- cat17.jpeg
| +-- dog15.jpg
+-- valid/
| +-- cats/
| +-- cat4.jpg
| +-- cat8.jpg
| +-- dogs/
| +-- dog9.jpg
| +-- dog14.png
+-- test/
| +-- cats/
| +-- cat90.jpg
| +-- cat34.jpeg
| +-- dogs/
| +-- dog81.jpg
| +-- dog98.pngText classification
export_format: FOLDERS
The .zip file contains a folder for every dataset split in your dataset, e.g. “train”, “test”, “unsup”.
Every dataset split folder contains a ’label-named’ folder for every label: “pos”, “neg”.
Every label folder contains all the images of the dataset items for the given label in the given dataset split.
Dataset items are named by the “name” and “extension” property of the dataset item. If the “name” property is empty, the “id” is used to name the file.
Dataset items that have no label, will be added to the split folder to which they belong.
.
+-- train/
| +-- pos/
| +-- 1.txt
| +-- 3.txt
| +-- neg/
| +-- 2.txt
| +-- 4.txt
+-- test/
| +-- pos/
| +-- 5.txt
| +-- 7.txt
| +-- neg/
| +-- 6.txt
| +-- 8.txt
+-- unsup/
| +-- 13.txt
| +-- 14.txtObject detection
export_format: YOLO
Object detection datasets are exported in YOLO v4 format.
For every dataset split a dataset_split_name.txt file gets created containing all the filenames for that dataset split.
Every dataset item will have an image and a txt file associated with it. The txt file contains a list of annotations in Yolo format: label_index relative_x relative_y relative_width relative_height.
The .names file contains the list of labels, where the index corresponds to the label_index in the annotation .txt files.
The config.json file contains a contains a json object with the color for every label.
.
+-- train.txt
+-- valid.txt
+-- test.txt
+-- 1.jpg
+-- 1.txt
+-- 3.jpg
+-- 3.txt
+-- ...
+-- `dataset_version_id`.names
+-- config.jsonA little more detail on the config.json file:
{
"colors": {
"label_1": "#36dfd4" ,
"label_2": "#f0699e"
}
}Tabular
export_format: CSV
Tabular datasets are exported in a .zip file that contains a dataset_version_id.csv file accompanied by a config.json, which provides more details on how the data should be interpreted.
.
+-- dataset_version_id.csv
+-- config.jsonA little more details about the config.json file:
{
"multi_label": false,
"label_column": "labels",
"split_column": "split",
"label_separator": " ",
"filename": "dataset_version_id.csv",
"csv_separator": ","
}| Parameter | Type | Description |
|---|---|---|
| multi_label | bool | Is the dataset multi label? |
| label_column | str | The column name that contains the labels |
| split_column | str | The column name that contains the name of the split the row belongs to |
| label_separator | str | If multi_label, use this separator to split the labels |
| filename | str | The name of the .csv file that contains the data |
| csv_separator | str | Use this separator to split each row into columns |
Named entity recognition
export_format: SPACY_NER
For an named entity recognition dataset with splits:
- train
- valid
- test
the zip file should be structured in the following way:
.
+-- train.json
+-- valid.json
+-- test.json
+-- config.jsonconfig.json
The config.json file contains a list of dataset splits, as well as a color code for every label.
{
"splits": [
"train",
"valid",
"test"
],
"colors": {
"label_name": "#82ebfd",
"label_name2": "#e95211"
}
}split_name.json
For every dataset split, there is a ‘split_name’.json file with the following structure:
[
{
"id": "the_dataset_item_id",
"name": "the_original_filename" ,
"text": "The textual content of the file that has been annotated.",
"annotations": [{
"start": 4,
"end": 11,
"label": "label_name",
},
{
...
}
]
},
{
...
}
]