Structuring Data¶
The main functionality of the Structify API lies in structuring information from a variety of sources on demand. Our structuring endpoints allow you to direct our research agents to collect data from the web, PDFs, and more to populate your datasets with structured information.
Populating Your Datasets¶
Once you have blueprinted your dataset by creating a schema, you can use Structify’s AI agents to collect and structure data.
You can run our scraper agents either through structify.structure.run
or structify.structure.run_async
to populate a dataset with an initial batch of data. The structure API call takes the following arguments:
name: (required) The name of the dataset you want to populate
source: (required) The source you want the agent to use to extract data from. More on this in Sources
extraction_criteria: (optional) The criteria you want the agent to use to extract data from the source. More on this in
Here’s an example of an API call to populate that employees dataset with data from LinkedIn using structify.structure.run
:
from structify import Structify
from structify.sources import Web
structify.structure.run(
dataset="employees",
source=Web(starting_website="linkedin.com")
)
Note
The output of structify.structure.run
will be a view of the extracted entities in the dataset after the run completes.
If you want to run the populate request asynchronously, you can use structify.structure.run_async
:
job_id = structify.structure.run_async(
dataset="employees",
source=Web(starting_website="linkedin.com")
)
structify.structure.job_status(job=[job_id])
Note
The output of structify.structure.run_async
will be a Job ID that you can use to access the run and view its status via structify.structure.job_status
.
Extraction Criteria¶
Extraction Criteria is a way to specify what you want the agent to extract from the source. It provides our agents with guidance as to the specific entities, properties, or relationships that need to appear for it to extract data to populate your dataset. If not specified, the default value will be an empty list, meaning the agent will extract any data from the provided source that is present in the schema. There are three types of extraction criteria that you can specify:
Required Entity In the case that you want to get data about a specific entity, you can specify the entity you want to extract. This extraction criteria does necessitate that you input the entity into the run or run_async call as follows:
from structify.extraction_criteria import RequiredEntity
structify.structure.run(
dataset="employees",
source=Web(starting_website="linkedin.com"),
extraction_criteria=[RequiredEntity(id=0)],
starting_entity={
"id": 0,
"type": "employee",
"properties": {
"name": "Jane Doe"
}
}
)
Note
The ID you specify in the extraction criteria must match the id of the starting_entity.
Required Property In the case that you want to require that a certain property be present for a table before extracting data, you can use the required property extraction criteria.
from structify.extraction_criteria import RequiredProperty
structify.structure.run(
dataset="employees",
source=Web(starting_website="linkedin.com"),
extraction_criteria=[RequiredProperty(
table="job",
properties=["title", "company"]
)]
)
Note
The agent will extract data if at least one of the specified properties are present.
Required Relationship In the case that you want to require that a certain relationship be present for a table before extracting data, you can use the required relationship extraction criteria.
from structify.extraction_criteria import RequiredRelationship
structify.structure.run(
dataset="employees",
source=Web(starting_website="linkedin.com"),
extraction_criteria=[RequiredRelationship(
relationship_name="worked"
)]
)
You can input multiple extraction criteria to ensure a set of conditions are met before saving data.
Sources¶
You can use a variety of sources to populate your datasets such as:
Web: Our AI agents can navigate the Web and scrape data at scale. This is our bread and butter.
PDF: We can also extract unstructured data from PDFs.
Text: If you have plain text you want to structure, you can use this source.
SEC Filings: We also have a direct integration to the SEC if you want to extract data from their filings.
DocumentImage: We support any other document types through this endpoint. It does require users to convert their documents into images first.
Below are some examples of how you can start structuring runs on each source:
Web¶
from structify.sources import Web
structify.structure.run_async(
dataset="employees",
source=Web(starting_website="linkedin.com")
)
PDF¶
from structify.sources import PDF
structify.structure.run_async(
name="employees",
source=PDF(path="path/to/pdf")
)
Note
The path to the PDF will be the remote path of the document uploaded to Structify. For more information on how to upload documents, see the Using Documents in Structify section. Or you can check out the tutorials at Step 1: Upload the Relevant Documents.
Text¶
For text, you can either input the text directly or use a path to a text file uploaded to Structify.
from structify.sources import Text
structify.structure.run_async(
dataset="employees",
source=Text(content="Jane Doe is the CEO of ACME. Previously she was the Senior VP at EMCA.")
)
structify.structure.run_async(
dataset="employees",
source=Text(path="path/to/text")
)
SEC Filings¶
from structify.sources import SECFiling
structify.structure.run_async(
dataset="employees",
source=SECFiling(
year=2021, # Optional
quarter=3, # Optional
accession_number="0000320193-21-000056" # Optional
)
)
DocumentImage¶
from structify.sources import DocumentImage
structify.structure.run_async(
dataset="employees",
source=DocumentImage(path="path/to/image")
)
Viewing Your Datasets¶
Through this endpoint, we allow users to view either all entities or all the relationships in their dataset.
entities = structify.dataset.view(
name="employees",
requested_type="Entities" # The default value is "Entities", but we show it here for clarity
)
relationships = structify.dataset.view(
name="employees",
requested_type="Relationships"
)
The output for each is an iterator which we can use to view the dataset as follows:
for entity in entities:
print(entity)
for relationship in relationships:
print(relationship)
Tip
To view a particular type of entity or relationship, you can add the table_name
or relationship_name
parameter to the respective view call.
Note
Keep your eye out for the structify.datasets.refresh
API call to update the data in your dataset.