Create your dataset


This low-code guide lets you create a Named Entity Recognition dataset, set up the labels, and upload your documents. To do this, you will need to complete the following steps:

  • Prerequisites
  • Set up the client
  • Create the dataset
  • Upload your documents


This guide expects to have the Python SDK as well as muPDF installed:

pip install seeme
pip install pymupdf # to convert PDF to text

Set up the Client

You need the client to interact with the backend:

from seeme import Client

cl = Client()

username = "" # Add your username
password = "" # Add your password

cl.login(username, password)

For more details or login options, have a look at the Client docs.

Create Dataset

Create a named entity recognition dataset:

from seeme.types import Dataset, DatasetContentType

dataset = Dataset(
  name="Entity recogntion",
  multi_label=True, # so you can add multiple annotations to each document
  default_splits=True, # already create train, valid, test dataset splits

dataset = cl.create_dataset(dataset)

ds_version = dataset.versions[0]

splits = ds_version.splits

For more details about datasets, have a look at the Dataset docs

Upload Documents

Once the dataset and labels - you can create additional labels later on - are created, you can start uploading your documents.

The NER dataset expects the dataset items to contain text information. In the case of PDF documents, we will need to convert them to text. Here we will use mupdf to extract all text from the PDF, before creating the dataset item:

import os
import glob
import fitz

# define the file location

pdf_folder = "pdf_files"

train_split = [split for split in splits if == "train"][0]

for filename in glob.glob(f"{pdf_folder}/*.pdf"):
    text = ""
    doc =
    for page in doc:
        text += page.get_text()
    ds_item = DatasetItem(
      name = filename,
      text = text,
      splits = [train_split],
      extension = "txt"

    cl.create_dataset_item(,, ds_item)