Create your dataset

Intro

This low-code guide lets you create a Named Entity Recognition dataset, set up the labels, and upload your documents. To do this, you will need to complete the following steps:

  • Prerequisites
  • Set up the client
  • Create the dataset
  • Upload your documents

Prerequisites

This guide expects to have the SeeMe.ai Python SDK as well as muPDF installed:

pip install seeme
pip install pymupdf # to convert PDF to text

Set up the Client

You need the SeeMe.ai client to interact with the backend:

from seeme import Client

cl = Client()

username = "" # Add your username
password = "" # Add your password

cl.login(username, password)

For more details or login options, have a look at the Client docs.

Create Dataset

Create a named entity recognition dataset:

from seeme.types import Dataset, DatasetContentType

dataset = Dataset(
  name="Entity recogntion",
  multi_label=True, # so you can add multiple annotations to each document
  default_splits=True, # already create train, valid, test dataset splits
  content_type=ContentType.NER
)

dataset = cl.create_dataset(dataset)

ds_version = dataset.versions[0]

splits = ds_version.splits

For more details about datasets, have a look at the Dataset docs

Upload Documents

Once the dataset and labels - you can create additional labels later on - are created, you can start uploading your documents.

The NER dataset expects the dataset items to contain text information. In the case of PDF documents, we will need to convert them to text. Here we will use mupdf to extract all text from the PDF, before creating the dataset item:

import os
import glob
import fitz

# define the file location

pdf_folder = "pdf_files"

train_split = [split for split in splits if split.name == "train"][0]

for filename in glob.glob(f"{pdf_folder}/*.pdf"):
    text = ""
    doc = fitz.open(filename)
    for page in doc:
        text += page.get_text()
      
    ds_item = DatasetItem(
      name = filename,
      text = text,
      splits = [train_split],
      extension = "txt"
    ) 

    cl.create_dataset_item(dataset.id, ds_version.id, ds_item)