Using Documents in Structify¶

In many cases, a wealth of unstructured data lies within documents without set formats. Structify allows you to upload these documents and extract the data you need from them.

Uploading Documents¶

You can upload documents to Structify using client.documents.upload by passing the file in binary mode. You will also need to specify the path you want to store the document on your Structify account, along with the type of document.

from structify import Structify
client = Structify()

file = "/path/to/your/document.pdf"

client.documents.upload(
    content=open(file, "rb").read(),
    path=b"/path/on/your/Structify/remote.pdf",
    file_type="PDF"
)

Currently, we support the following document formats:

PDFs
Text files (TXT, CSV, etc.)
Images (JPG, PNG, etc.)

We are working to support more formats in the future, such as:

Word documents (DOCX)
Excel spreadsheets (XLSX)
PowerPoint presentations (PPTX)

In the meantime, we recommend converting all your documents to either PDFs or images before uploading them to Structify. With the DocumentImage structuring endpoint (read more about it in the Structuring Data section), you can extract data from any document type after converting it to an image.

Once you’ve uploaded them, you can use our other document endpoints to list, download, and delete the documents.

Here are examples of how you would use those endpoints:

# Listing all documents will return a JSON object of all your uploaded documents
client.documents.list()

# Download a document by specifying the name of the document. This will return the document in binary mode, which we can save to your local machine.
document = client.documents.download(file_path='/path/on/your/Structify/remote.pdf')
open("downloaded_document.pdf", "wb").write(document.read())

# Delete a document by specifying the name of the document. This will remove the document from your Structify account.
client.documents.delete(file_path='/path/on/your/Structify/remote.pdf')

Extracting Data from Documents¶

Creating datasets from documents is quite simple. You just use client.structure.run_async method and specify the document file path or paths you want to include in the dataset through the relevant Python object.

from structify.types.structure_run_async_params import SourcePdf, SourcePdfPdf

client.structure.run_async(
    dataset="startups",
    source=SourcePdf(pdf=SourcePdfPdf(path="/path/to/your/document.pdf")),
)

And just like that you’ve created a dataset from your documents.