Python SDK Reference

documents

documents

Functions

`format_pdf_date(date_str)`

Convert PDF date string to readable format. PDF dates are typically in format “D:YYYYMMDDHHmmSS” or “D:YYYYMMDDHHmmSS+HH’mm'”

`get_word_text(filename: str, metadata: dict) -> list[dict]`

Get all text from a Word file, using the package ‘docx’. No images are currently OCR’d.

`get_pdf_text(filename: str, metadata: dict) -> list[dict]`

Get all text from a PDF file, using the package ‘pymupdf’. No images or image PDFs are currently OCR’d.

`get_filename_metadata(filename: str, metadata_id: [str, int])`

Get the filename metadata.

`process_filename(filename: str, metadata_id: Union[str, int]) -> tuple[list[dict], dict]`

Get all texts and metadata from a PDF/Word file.

`process_documents(folder: str, print_info: bool = True, exclude_extensions: List = [])`

`download_nltk_data(package = 'punkt')`

`basic_sentence_chunks(text: str, chunk_size: int = 3) -> List[str]`

`sliding_window_chunks(text: str, window_size: int = 512, stride: int = 256) -> List[str]`

Split text using sliding window with overlap.

`semantic_clustering_chunks(text: str, num_clusters: int = 3) -> List[str]`

Split text using sentence embeddings and clustering.

`smart_paragraph_chunks(text: str, max_chunk_size: int = 1000) -> List[str]`

Split text into chunks based on paragraph breaks while respecting size limits.