Import Datasets

Import Datasets

Image classification

format: FOLDERS

For an image classification dataset with splits:

  • train
  • valid
  • test

and labels:

  • cats
  • dogs

the zip file should be structured in the following way

.
+-- train/
|  +-- cats/
|     +-- cat1.jpg
|     +-- cat12.jpeg
|  +-- dogs/
|     +-- dog2.jpg
|     +-- dog4.png
|  +-- cat17.jpeg
|  +-- dog15.jpg
+-- valid/
|  +-- cats/
|     +-- cat4.jpg
|     +-- cat8.jpg
|  +-- dogs/
|     +-- dog9.jpg
|     +-- dog14.png
+-- test/
|  +-- cats/
|     +-- cat90.jpg
|     +-- cat34.jpeg
|  +-- dogs/
|     +-- dog81.jpg
|     +-- dog98.png

Text classification

format: FOLDERS

For an text classification dataset with splits:

  • train
  • valid
  • unsup

and labels:

  • pos
  • neg

the zip file should be structured in the following way

.
+-- train/
|  +-- pos/
|     +-- 1.txt
|     +-- 3.txt
|  +-- neg/
|     +-- 2.txt
|     +-- 4.txt
+-- test/
|  +-- pos/
|     +-- 5.txt
|     +-- 7.txt
|  +-- neg/
|     +-- 6.txt
|     +-- 8.txt
+-- unsup/
|     +-- 13.txt
|     +-- 14.txt

Object detection

format: YOLO

Object detection datasets are imported in YOLO format.

For every dataset split a dataset_split_name.txt file gets created containing all the filenames for that dataset split.

Every dataset item will have an image and a txt file associated with it. The txt file contains a list of annotations in Yolo format: label_index relative_x relative_y relative_width relative_height.

The .names file contains the list of labels, where the index corresponds to the label_index in the annotation .txt files.

The config.json file contains a contains a json object with the color for every label.

.
+-- train.txt
+-- valid.txt
+-- test.txt
+-- 1.jpg
+-- 1.txt
+-- 3.jpg
+-- 3.txt
+-- ...
+-- dataset_version_id.names
+-- config.json

A little more detail on the config.json file:

{ 
    "colors": { 
        "label_name1": "#36dfd4" , 
        "label_name2": "#f0699e" 
    }
}

Tabular

format: CSV

Tabular datasets are imported from .csv files accompanied by a config.json file that provides more details on how the data should be interpreted.

.
+-- dataset.csv
+-- config.json
Config file

A little more details about the config.json file:

{
    "multi_label": false,
    "label_column": "labels",
    "split_column": "split",
    "label_separator": " ",
    "filename": "dataset.csv",
    "csv_separator": ","
}
ParameterTypeDescription
multi_labelboolIs the dataset multi label?
label_columnstrThe column name that contains the labels
split_columnstrThe column name that contains the name of the split the row belongs to
label_separatorstrIf multi_label, use this separator to split the labels
filenamestrThe name of the .csv file that contains the data
csv_separatorstrUse this separator to split each row into columns

Named entity recognition

format: SPACY_NER

For an named entity recognition dataset with splits:

  • train
  • valid
  • test

the zip file should be structured in the following way:

.
+-- train.json
+-- valid.json
+-- test.json
+-- config.json
config.json

The config.json file contains a list of dataset splits, as well as a color code for every label.

{
    "splits": [
        "train",
        "valid",
        "test"
    ],
    "colors": {
        "label_name": "#82ebfd",
        "label_name2": "#e95211"
    }
}
split_name.json

For every dataset split, there is a ‘split_name’.json file with the following structure:

[{
    "id": "the_dataset_item_id",
    "name": "the_original_filename" ,
    "text": "The textual content of the file that has been annotated.",
    "annotations": [{
        "start": 4,
        "end": 11,
        "label": "label_name",
    },
     {
         ...
     }
    ]
},
{
    ...
}
]
ParameterTypeDescription
multi_labelboolIs the dataset multi label?
dataset_idstrThe dataset id
dataset_version_idstrThe dataset version id
folderstrThe folder that contains the .zip file. Default value: “data”
filenamestrThe name of the upload file. Default value: “dataset.zip”
formatDatasetFormatThe format of your dataset.
Supported import formats: FOLDERS, YOLO, CSV, NER