Discover API

Autogenerated reference for the discovery module.

`discover`

Public discovery helpers for multimodal feature schema generation.

The functions in this module accept raw inputs or folders on disk, delegate the actual reasoning to a provider, and persist the discovered schema as JSON in an output directory. Discovery is intentionally folder-oriented so the same API can be used from notebooks, scripts, and batch pipelines.

Functions:

discover_features_from_images –

Discover features from image files and persist the provider response.
discover_features_from_tabular –

Discover features from tabular datasets by projecting a text column.
discover_features_from_texts –

Discover features from text strings, files, or folders of documents.
discover_features_from_videos –

Discover features from one or more videos.

`discover_features_from_images(image_paths_or_folder: str | List[str], prompt: str = image_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None, min_features: int = 10) -> DiscoveryResult`

Discover features from image files and persist the provider response.

Parameters:

image_paths_or_folder
(str | List[str]) –

A single image path, a folder containing images, or a list of image file paths.
prompt
(str, default: image_discovery_prompt ) –

System-style prompt passed through to the provider.
provider
(Optional[OpenAIProvider], default: None ) –

Optional provider instance. When omitted, an OpenAIProvider is created from environment variables.
as_set
(bool, default: True ) –

When True, all images are analyzed together and a single shared feature schema is produced. When False, each image is sent independently and the result contains one entry per image.
output_dir
(str | Path, default: 'outputs' ) –

Directory where the JSON artifact should be written.
output_filename
(Optional[str], default: None ) –

Custom filename for the saved artifact. Defaults to discovered_image_features.json.
min_features
(int, default: 10 ) –

Minimum number of features to request from the provider.

Returns:

DiscoveryResult –

A single discovery payload in joint mode or a list of payloads in
DiscoveryResult –

per-image mode. The on-disk JSON always preserves the raw provider
DiscoveryResult –

result list.

Raises:

FileNotFoundError –

If the provided path does not exist.
ValueError –

If no supported image files are found.
RuntimeError –

If image decoding fails for every candidate input.

`discover_features_from_tabular(file_or_folder: str | Path, text_column: str, provider: Optional[OpenAIProvider] = None, prompt: str = text_discovery_prompt, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None, max_rows: Optional[int] = None, min_features: int = 10) -> DiscoveryResult`

Discover features from tabular datasets by projecting a text column.

Supported files are loaded into a single DataFrame, the selected text column is extracted, and the resulting list of strings is delegated to discover_features_from_texts.

Parameters:

file_or_folder
(str | Path) –

A single tabular file or a directory containing supported tabular files.
text_column
(str) –

Column name whose values should be used as textual input for discovery.
provider
(Optional[OpenAIProvider], default: None ) –

Optional provider instance.
prompt
(str, default: text_discovery_prompt ) –

Prompt passed through to the provider.
as_set
(bool, default: True ) –

Whether to discover one shared schema across all sampled rows or process rows independently.
output_dir
(str | Path, default: 'outputs' ) –

Directory where the JSON artifact should be written.
output_filename
(Optional[str], default: None ) –

Custom filename for the saved artifact. Defaults to discovered_tabular_features.json.
max_rows
(Optional[int], default: None ) –

Optional cap on how many rows are used from the concatenated dataset.
min_features
(int, default: 10 ) –

Minimum number of distinct features to request from the provider.

Returns:

DiscoveryResult –

The same return shape as
DiscoveryResult –

discover_features_from_texts.

Raises:

FileNotFoundError –

If the provided path does not exist.
ValueError –

If no supported tabular files are found or text_column is missing.

`discover_features_from_texts(texts_or_file: str | List[str], prompt: str = text_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None, num_classes: Optional[int] = None, min_features: int = 10) -> DiscoveryResult`

Discover features from text strings, files, or folders of documents.

Parameters:

texts_or_file
(str | List[str]) –

Either a raw text string, a list of raw text strings, a single supported document path, or a directory containing supported text documents. String inputs are treated as paths only when they already exist on disk or look path-like, such as notes/file.txt.
prompt
(str, default: text_discovery_prompt ) –

Prompt passed through to the provider.
provider
(Optional[OpenAIProvider], default: None ) –

Optional provider instance. Defaults to OpenAIProvider.
as_set
(bool, default: True ) –

When True, all extracted text is combined into a single request so the provider can discover a shared schema. When False, each text chunk is processed independently.
output_dir
(str | Path, default: 'outputs' ) –

Directory where the JSON artifact should be written.
output_filename
(Optional[str], default: None ) –

Custom filename for the saved artifact. Defaults to discovered_text_features.json.
num_classes
(Optional[int], default: None ) –

Optional number of hidden classes reflected in the prompt.
min_features
(int, default: 10 ) –

Minimum number of distinct features to request from the provider.

Returns:

DiscoveryResult –

A single discovery payload in joint mode or a list of payloads in
DiscoveryResult –

per-text mode.

Raises:

FileNotFoundError –

If a path-like input does not exist.
ValueError –

If the path is invalid or no supported text input can be extracted.

`discover_features_from_videos(videos_or_folder: str | List[str], prompt: str = image_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, num_frames: int = 5, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None, use_audio: bool = True, max_videos_to_sample: int = 5, max_total_frames_payload: int = 15, random_seed: Optional[int] = None, min_features: int = 10) -> DiscoveryResult`

Discover features from one or more videos.

Each video is converted into representative frames and, optionally, an audio transcript. The resulting multimodal payload is sent to the provider and the raw response is written to JSON.

Parameters:

videos_or_folder
(str | List[str]) –

A single video path, a folder containing videos, or a list of video file paths.
prompt
(str, default: image_discovery_prompt ) –

Prompt passed through to the provider.
provider
(Optional[OpenAIProvider], default: None ) –

Optional provider instance implementing image_features and, when use_audio=True, optionally transcribe_audio.
as_set
(bool, default: True ) –

When True, all extracted frames are analyzed together to produce one shared schema. When False, all extracted frames are pooled together and analyzed individually, so the returned list has one entry per extracted frame rather than one entry per source video.
num_frames
(int, default: 5 ) –

Target number of key frames to extract per video before downsampling across the batch.
output_dir
(str | Path, default: 'outputs' ) –

Directory where the JSON artifact should be written.
output_filename
(Optional[str], default: None ) –

Custom filename for the saved artifact. Defaults to discovered_video_features.json.
use_audio
(bool, default: True ) –

Whether to extract an audio track and include a transcript as extra context when the provider supports transcription.
max_videos_to_sample
(int, default: 5 ) –

Upper bound on how many videos are sampled from a folder input to control cost and payload size. When a folder contains more than this many videos, a subset is sampled before frame extraction.
max_total_frames_payload
(int, default: 15 ) –

Upper bound on the total number of frames sent to the provider across the batch.
random_seed
(Optional[int], default: None ) –

Optional seed used when folder inputs need to sample a subset of videos. Pass a value here to make the sampled subset reproducible across runs.
min_features
(int, default: 10 ) –

Minimum number of features to request from the provider.

Returns:

DiscoveryResult –

A single discovery payload in joint mode or a list of payloads in
DiscoveryResult –

pooled per-frame mode.

Raises:

FileNotFoundError –

If the input path is missing or a folder contains no supported video files.
ValueError –

If no frames can be extracted from the provided videos.

Discover API

`discover`

`discover_features_from_images(image_paths_or_folder: str | List[str], prompt: str = image_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None, min_features: int = 10) -> DiscoveryResult`

`image_paths_or_folder`

`prompt`

`provider`

`as_set`

`output_dir`

`output_filename`

`min_features`

`file_or_folder`

`text_column`

`provider`

`prompt`

`as_set`

`output_dir`

`output_filename`

`max_rows`

`min_features`

`texts_or_file`

`prompt`

`provider`

`as_set`

`output_dir`

`output_filename`

`num_classes`

`min_features`

`videos_or_folder`

`prompt`

`provider`

`as_set`

`num_frames`

`output_dir`

`output_filename`

`use_audio`

`max_videos_to_sample`

`max_total_frames_payload`

`random_seed`

`min_features`