Import Datasets
Image classification
format: FOLDERS
For an image classification dataset with splits:
- train
- valid
- test
and labels:
- cats
- dogs
the zip file should be structured in the following way
.
+-- train/
| +-- cats/
| +-- cat1.jpg
| +-- cat12.jpeg
| +-- dogs/
| +-- dog2.jpg
| +-- dog4.png
| +-- cat17.jpeg
| +-- dog15.jpg
+-- valid/
| +-- cats/
| +-- cat4.jpg
| +-- cat8.jpg
| +-- dogs/
| +-- dog9.jpg
| +-- dog14.png
+-- test/
| +-- cats/
| +-- cat90.jpg
| +-- cat34.jpeg
| +-- dogs/
| +-- dog81.jpg
| +-- dog98.png
Text classification
format: FOLDERS
For an text classification dataset with splits:
- train
- valid
- unsup
and labels:
- pos
- neg
the zip file should be structured in the following way
.
+-- train/
| +-- pos/
| +-- 1.txt
| +-- 3.txt
| +-- neg/
| +-- 2.txt
| +-- 4.txt
+-- test/
| +-- pos/
| +-- 5.txt
| +-- 7.txt
| +-- neg/
| +-- 6.txt
| +-- 8.txt
+-- unsup/
| +-- 13.txt
| +-- 14.txt
Object detection
format: YOLO
Object detection datasets are imported in YOLO format.
For every dataset split
a dataset_split_name.txt
file gets created containing all the filenames for that dataset split
.
Every dataset item
will have an image and a txt file associated with it. The txt file contains a list of annotations in Yolo format: label_index relative_x relative_y relative_width relative_height
.
The .names file contains the list of labels, where the index corresponds to the label_index in the annotation .txt files.
The config.json file contains a contains a json object with the color for every label.
.
+-- train.txt
+-- valid.txt
+-- test.txt
+-- 1.jpg
+-- 1.txt
+-- 3.jpg
+-- 3.txt
+-- ...
+-- dataset_version_id.names
+-- config.json
A little more detail on the config.json file:
{
"colors": {
"label_name1": "#36dfd4" ,
"label_name2": "#f0699e"
}
}
Tabular
format: CSV
Tabular datasets are imported from .csv files accompanied by a config.json
file that provides more details on how the data should be interpreted.
.
+-- dataset.csv
+-- config.json
Config file
A little more details about the config.json
file:
{
"multi_label": false,
"label_column": "labels",
"split_column": "split",
"label_separator": " ",
"filename": "dataset.csv",
"csv_separator": ","
}
Parameter | Type | Description |
---|---|---|
multi_label | bool | Is the dataset multi label? |
label_column | str | The column name that contains the labels |
split_column | str | The column name that contains the name of the split the row belongs to |
label_separator | str | If multi_label , use this separator to split the labels |
filename | str | The name of the .csv file that contains the data |
csv_separator | str | Use this separator to split each row into columns |
Named entity recognition
format: SPACY_NER
For an named entity recognition dataset with splits:
- train
- valid
- test
the zip file should be structured in the following way:
.
+-- train.json
+-- valid.json
+-- test.json
+-- config.json
config.json
The config.json
file contains a list of dataset splits, as well as a color code for every label.
{
"splits": [
"train",
"valid",
"test"
],
"colors": {
"label_name": "#82ebfd",
"label_name2": "#e95211"
}
}
split_name.json
For every dataset split, there is a ‘split_name’.json file with the following structure:
[{
"id": "the_dataset_item_id",
"name": "the_original_filename" ,
"text": "The textual content of the file that has been annotated.",
"annotations": [{
"start": 4,
"end": 11,
"label": "label_name",
},
{
...
}
]
},
{
...
}
]
Parameter | Type | Description |
---|---|---|
multi_label | bool | Is the dataset multi label? |
dataset_id | str | The dataset id |
dataset_version_id | str | The dataset version id |
folder | str | The folder that contains the .zip file. Default value: “data” |
filename | str | The name of the upload file. Default value: “dataset.zip” |
format | DatasetFormat | The format of your dataset. Supported import formats: FOLDERS, YOLO, CSV, NER |