Export Datasets
client.download_dataset(
my_dataset.id,
new_dataset_version.id,
split_id="",
extract_to_dir="data",
download_file="dataset.zip",
remove_download_file=True,
export_format=""
)
Parameter | Type | Description |
---|---|---|
dataset_id | str | The dataset id |
dataset_version_id | str | The dataset version id |
split_id | str | (Optional) Specify the split_id if you only want to download that dataset split. |
extract_to_dir | str | The directory to extract to. Default value: “data” |
download_file | str | The name of the download file. Default value: “dataset.zip” |
remove_download_file | bool | Flag indicating whether to remove or keep the downloaded zip file. Default value: True |
export_format | DatasetFormat | The format of your dataset. Supported export formats: FOLDERS, YOLO, CSV, SPACY_NER |
A note on the exported formats.
Image classification
export_format: FOLDERS
The .zip file contains a folder for every dataset split
: “train”, “valid”, “test”.
Every dataset split
folder contains a ’label-named’ folder for every label
: “cats”, “dogs”.
Every label
folder contains all the images of the dataset items
for the given label
in the given dataset split
.
Dataset items
are named by the “name” and “extension” property of the dataset item
. If the “name” property is empty, the “id” is used to name the file.
Dataset items
that have no label, will be added to the split folder
to which they belong.
.
+-- train/
| +-- cats/
| +-- cat1.jpg
| +-- cat12.jpeg
| +-- dogs/
| +-- dog2.jpg
| +-- dog4.png
| +-- cat17.jpeg
| +-- dog15.jpg
+-- valid/
| +-- cats/
| +-- cat4.jpg
| +-- cat8.jpg
| +-- dogs/
| +-- dog9.jpg
| +-- dog14.png
+-- test/
| +-- cats/
| +-- cat90.jpg
| +-- cat34.jpeg
| +-- dogs/
| +-- dog81.jpg
| +-- dog98.png
Text classification
export_format: FOLDERS
The .zip file contains a folder for every dataset split
in your dataset, e.g. “train”, “test”, “unsup”.
Every dataset split
folder contains a ’label-named’ folder for every label
: “pos”, “neg”.
Every label
folder contains all the images of the dataset items
for the given label
in the given dataset split
.
Dataset items
are named by the “name” and “extension” property of the dataset item
. If the “name” property is empty, the “id” is used to name the file.
Dataset items
that have no label, will be added to the split folder
to which they belong.
.
+-- train/
| +-- pos/
| +-- 1.txt
| +-- 3.txt
| +-- neg/
| +-- 2.txt
| +-- 4.txt
+-- test/
| +-- pos/
| +-- 5.txt
| +-- 7.txt
| +-- neg/
| +-- 6.txt
| +-- 8.txt
+-- unsup/
| +-- 13.txt
| +-- 14.txt
Object detection
export_format: YOLO
Object detection datasets are exported in YOLO v4 format.
For every dataset split
a dataset_split_name.txt
file gets created containing all the filenames for that dataset split
.
Every dataset item
will have an image and a txt file associated with it. The txt file contains a list of annotations in Yolo format: label_index relative_x relative_y relative_width relative_height.
The .names file contains the list of labels, where the index corresponds to the label_index in the annotation .txt files.
The config.json file contains a contains a json object with the color for every label.
.
+-- train.txt
+-- valid.txt
+-- test.txt
+-- 1.jpg
+-- 1.txt
+-- 3.jpg
+-- 3.txt
+-- ...
+-- `dataset_version_id`.names
+-- config.json
A little more detail on the config.json file:
{
"colors": {
"label_1": "#36dfd4" ,
"label_2": "#f0699e"
}
}
Tabular
export_format: CSV
Tabular datasets are exported in a .zip file that contains a dataset_version_id.csv
file accompanied by a config.json
, which provides more details on how the data should be interpreted.
.
+-- dataset_version_id.csv
+-- config.json
A little more details about the config.json file:
{
"multi_label": false,
"label_column": "labels",
"split_column": "split",
"label_separator": " ",
"filename": "dataset_version_id.csv",
"csv_separator": ","
}
Parameter | Type | Description |
---|---|---|
multi_label | bool | Is the dataset multi label? |
label_column | str | The column name that contains the labels |
split_column | str | The column name that contains the name of the split the row belongs to |
label_separator | str | If multi_label , use this separator to split the labels |
filename | str | The name of the .csv file that contains the data |
csv_separator | str | Use this separator to split each row into columns |
Named entity recognition
export_format: SPACY_NER
For an named entity recognition dataset with splits:
- train
- valid
- test
the zip file should be structured in the following way:
.
+-- train.json
+-- valid.json
+-- test.json
+-- config.json
config.json
The config.json
file contains a list of dataset splits, as well as a color code for every label.
{
"splits": [
"train",
"valid",
"test"
],
"colors": {
"label_name": "#82ebfd",
"label_name2": "#e95211"
}
}
split_name.json
For every dataset split, there is a ‘split_name’.json file with the following structure:
[
{
"id": "the_dataset_item_id",
"name": "the_original_filename" ,
"text": "The textual content of the file that has been annotated.",
"annotations": [{
"start": 4,
"end": 11,
"label": "label_name",
},
{
...
}
]
},
{
...
}
]