documents
Functions
format_pdf_date(date_str)
Convert PDF date string to readable format. PDF dates are typically in format “D:YYYYMMDDHHmmSS” or “D:YYYYMMDDHHmmSS+HH’mm'”
get_word_text(filename: str, metadata: dict) -> list[dict]
Get all text from a Word file, using the package ‘docx’. No images are currently OCR’d.
get_pdf_text(filename: str, metadata: dict) -> list[dict]
Get all text from a PDF file, using the package ‘pymupdf’. No images or image PDFs are currently OCR’d.
get_filename_metadata(filename: str, metadata_id: [str, int])
Get the filename metadata.
process_filename(filename: str, metadata_id: Union[str, int]) -> tuple[list[dict], dict]
Get all texts and metadata from a PDF/Word file.
process_documents(folder: str, print_info: bool = True, exclude_extensions: List = [])
download_nltk_data(package = 'punkt')
basic_sentence_chunks(text: str, chunk_size: int = 3) -> List[str]
sliding_window_chunks(text: str, window_size: int = 512, stride: int = 256) -> List[str]
Split text using sliding window with overlap.
semantic_clustering_chunks(text: str, num_clusters: int = 3) -> List[str]
Split text using sentence embeddings and clustering.
smart_paragraph_chunks(text: str, max_chunk_size: int = 1000) -> List[str]
Split text into chunks based on paragraph breaks while respecting size limits.