Create your dataset
Intro
This low-code guide lets you create a Named Entity Recognition dataset, set up the labels, and upload your documents. To do this, you will need to complete the following steps:
- Prerequisites
- Set up the client
- Create the dataset
- Upload your documents
Prerequisites
This guide expects to have the SeeMe.ai Python SDK as well as muPDF installed:
pip install seeme
pip install pymupdf # to convert PDF to text
Set up the Client
You need the SeeMe.ai client to interact with the backend:
from seeme import Client
cl = Client()
username = "" # Add your username
password = "" # Add your password
cl.login(username, password)
For more details or login options, have a look at the Client docs.
Create Dataset
Create a named entity recognition dataset:
from seeme.types import Dataset, DatasetContentType
dataset = Dataset(
name="Entity recogntion",
multi_label=True, # so you can add multiple annotations to each document
default_splits=True, # already create train, valid, test dataset splits
content_type=ContentType.NER
)
dataset = cl.create_dataset(dataset)
ds_version = dataset.versions[0]
splits = ds_version.splits
For more details about datasets, have a look at the Dataset docs
Upload Documents
Once the dataset and labels - you can create additional labels later on - are created, you can start uploading your documents.
The NER dataset expects the dataset items to contain text information. In the case of PDF documents, we will need to convert them to text. Here we will use mupdf to extract all text from the PDF, before creating the dataset item:
import os
import glob
import fitz
# define the file location
pdf_folder = "pdf_files"
train_split = [split for split in splits if split.name == "train"][0]
for filename in glob.glob(f"{pdf_folder}/*.pdf"):
text = ""
doc = fitz.open(filename)
for page in doc:
text += page.get_text()
ds_item = DatasetItem(
name = filename,
text = text,
splits = [train_split],
extension = "txt"
)
cl.create_dataset_item(dataset.id, ds_version.id, ds_item)