UnstructuredIO
create_element_from_text
- text (str): The text content for the element.
- element_id (Optional[str], optional): Unique identifier for the element. (default: :obj:
None
) - embeddings (List[float], optional): A list of float numbers representing the text embeddings. (default: :obj:
None
) - filename (Optional[str], optional): The name of the file the element is associated with. (default: :obj:
None
) - file_directory (Optional[str], optional): The directory path where the file is located. (default: :obj:
None
) - last_modified (Optional[str], optional): The last modified date of the file. (default: :obj:
None
) - filetype (Optional[str], optional): The type of the file. (default: :obj:
None
) - parent_id (Optional[str], optional): The identifier of the parent element. (default: :obj:
None
)
parse_file_or_url
- input_path (str): Path to the file or URL to be parsed. **kwargs: Extra kwargs passed to the partition function.
parse_bytes
- file (IO[bytes]): The file in bytes format to be parsed. **kwargs: Extra kwargs passed to the partition function.
None
.
Note:
Supported file types:
“csv”, “doc”, “docx”, “epub”, “image”, “md”, “msg”, “odt”,
“org”, “pdf”, “ppt”, “pptx”, “rtf”, “rst”, “tsv”, “xlsx”.
References:
https://docs.unstructured.io/open-source/core-functionality/partitioning
clean_text_data
unstructured
library.
This function applies multiple text cleaning utilities by calling the
unstructured
library’s cleaning bricks for operations like
replacing Unicode quotes, removing extra whitespace, dashes, non-ascii
characters, and more.
If no cleaning options are provided, a default set of cleaning
operations is applied. These defaults including operations
“replace_unicode_quotes”, “clean_non_ascii_chars”,
“group_broken_paragraphs”, and “clean_extra_whitespace”.
Parameters:
- text (str): The text to be cleaned.
- clean_options (dict): A dictionary specifying which cleaning options to apply. The keys should match the names of the cleaning functions, and the values should be dictionaries containing the parameters for each function. Supported types: ‘clean_extra_whitespace’, ‘clean_bullets’, ‘clean_ordered_bullets’, ‘clean_postfix’, ‘clean_prefix’, ‘clean_dashes’, ‘clean_trailing_punctuation’, ‘clean_non_ascii_chars’, ‘group_broken_paragraphs’, ‘remove_punctuation’, ‘replace_unicode_quotes’, ‘bytes_string_to_string’, ‘translate_text’.
unstructured
library.
Each brick’s parameters must be provided in a nested dictionary
as the value for the key.
References:
https://unstructured-io.github.io/unstructured/
extract_data_from_text
- text (str): Text to extract data from. extract_type (Literal[‘extract_datetimetz’, ‘extract_email_address’, ‘extract_ip_address’, ‘extract_ip_address_name’, ‘extract_mapi_id’, ‘extract_ordered_bullets’, ‘extract_text_after’, ‘extract_text_before’, ‘extract_us_phone_number’]): Type of data to extract. **kwargs: Additional keyword arguments for specific extraction functions.
stage_elements
- elements (List[Any]): List of Element objects to be staged. stage_type (Literal[‘convert_to_csv’, ‘convert_to_dataframe’, ‘convert_to_dict’, ‘dict_to_elements’, ‘stage_csv_for_prodigy’, ‘stage_for_prodigy’, ‘stage_for_baseplate’, ‘stage_for_datasaur’, ‘stage_for_label_box’, ‘stage_for_label_studio’, ‘stage_for_weaviate’]): Type of staging to perform. **kwargs: Additional keyword arguments specific to the staging type.
chunk_elements
- elements (List[Element]): List of Element objects to be chunked.
- chunk_type (str): Type chunk going to apply. Supported types: ‘chunk_by_title’. **kwargs: Additional keyword arguments for chunking.