Skip to content

Discover API

Autogenerated reference for the discovery module.

discover

Public discovery helpers for multimodal feature schema generation.

The functions in this module accept raw inputs or folders on disk, delegate the actual reasoning to a provider, and persist the discovered schema as JSON in an output directory. Discovery is intentionally folder-oriented so the same API can be used from notebooks, scripts, and batch pipelines.

Functions:

discover_features_from_images(image_paths_or_folder: str | List[str], prompt: str = image_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None) -> DiscoveryResult

Discover features from image files and persist the provider response.

Parameters:

  • image_paths_or_folder

    (str | List[str]) –

    A single image path, a folder containing images, or a list of image file paths.

  • prompt

    (str, default: image_discovery_prompt ) –

    System-style prompt passed through to the provider.

  • provider

    (Optional[OpenAIProvider], default: None ) –

    Optional provider instance. When omitted, an OpenAIProvider is created from environment variables.

  • as_set

    (bool, default: True ) –

    When True, all images are analyzed together and a single shared feature schema is produced. When False, each image is sent independently and the result contains one entry per image.

  • output_dir

    (str | Path, default: 'outputs' ) –

    Directory where the JSON artifact should be written.

  • output_filename

    (Optional[str], default: None ) –

    Custom filename for the saved artifact. Defaults to discovered_image_features.json.

Returns:

  • DiscoveryResult

    A single discovery payload in joint mode or a list of payloads in

  • DiscoveryResult

    per-image mode. The on-disk JSON always preserves the raw provider

  • DiscoveryResult

    result list.

Raises:

  • FileNotFoundError

    If the provided path does not exist.

  • ValueError

    If no supported image files are found.

  • RuntimeError

    If image decoding fails for every candidate input.

discover_features_from_tabular(file_or_folder: str | Path, text_column: str, provider: Optional[OpenAIProvider] = None, prompt: str = text_discovery_prompt, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None, max_rows: Optional[int] = None) -> DiscoveryResult

Discover features from tabular datasets by projecting a text column.

Supported files are loaded into a single DataFrame, the selected text column is extracted, and the resulting list of strings is delegated to discover_features_from_texts.

Parameters:

  • file_or_folder

    (str | Path) –

    A single tabular file or a directory containing supported tabular files.

  • text_column

    (str) –

    Column name whose values should be used as textual input for discovery.

  • provider

    (Optional[OpenAIProvider], default: None ) –

    Optional provider instance.

  • prompt

    (str, default: text_discovery_prompt ) –

    Prompt passed through to the provider.

  • as_set

    (bool, default: True ) –

    Whether to discover one shared schema across all sampled rows or process rows independently.

  • output_dir

    (str | Path, default: 'outputs' ) –

    Directory where the JSON artifact should be written.

  • output_filename

    (Optional[str], default: None ) –

    Custom filename for the saved artifact. Defaults to discovered_tabular_features.json.

  • max_rows

    (Optional[int], default: None ) –

    Optional cap on how many rows are used from the concatenated dataset.

Returns:

Raises:

  • FileNotFoundError

    If the provided path does not exist.

  • ValueError

    If no supported tabular files are found or text_column is missing.

discover_features_from_texts(texts_or_file: str | List[str], prompt: str = text_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None) -> DiscoveryResult

Discover features from text strings, files, or folders of documents.

Parameters:

  • texts_or_file

    (str | List[str]) –

    Either a raw text string, a list of raw text strings, a single supported document path, or a directory containing supported text documents. String inputs are treated as paths only when they already exist on disk or look path-like, such as notes/file.txt.

  • prompt

    (str, default: text_discovery_prompt ) –

    Prompt passed through to the provider.

  • provider

    (Optional[OpenAIProvider], default: None ) –

    Optional provider instance. Defaults to OpenAIProvider.

  • as_set

    (bool, default: True ) –

    When True, all extracted text is combined into a single request so the provider can discover a shared schema. When False, each text chunk is processed independently.

  • output_dir

    (str | Path, default: 'outputs' ) –

    Directory where the JSON artifact should be written.

  • output_filename

    (Optional[str], default: None ) –

    Custom filename for the saved artifact. Defaults to discovered_text_features.json.

Returns:

  • DiscoveryResult

    A single discovery payload in joint mode or a list of payloads in

  • DiscoveryResult

    per-text mode.

Raises:

  • FileNotFoundError

    If a path-like input does not exist.

  • ValueError

    If the path is invalid or no supported text input can be extracted.

discover_features_from_videos(videos_or_folder: str | List[str], prompt: str = image_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, num_frames: int = 5, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None, use_audio: bool = True, max_videos_to_sample: int = 5, max_total_frames_payload: int = 15, random_seed: Optional[int] = None) -> DiscoveryResult

Discover features from one or more videos.

Each video is converted into representative frames and, optionally, an audio transcript. The resulting multimodal payload is sent to the provider and the raw response is written to JSON.

Parameters:

  • videos_or_folder

    (str | List[str]) –

    A single video path, a folder containing videos, or a list of video file paths.

  • prompt

    (str, default: image_discovery_prompt ) –

    Prompt passed through to the provider.

  • provider

    (Optional[OpenAIProvider], default: None ) –

    Optional provider instance implementing image_features and, when use_audio=True, optionally transcribe_audio.

  • as_set

    (bool, default: True ) –

    When True, all extracted frames are analyzed together to produce one shared schema. When False, all extracted frames are pooled together and analyzed individually, so the returned list has one entry per extracted frame rather than one entry per source video.

  • num_frames

    (int, default: 5 ) –

    Target number of key frames to extract per video before downsampling across the batch.

  • output_dir

    (str | Path, default: 'outputs' ) –

    Directory where the JSON artifact should be written.

  • output_filename

    (Optional[str], default: None ) –

    Custom filename for the saved artifact. Defaults to discovered_video_features.json.

  • use_audio

    (bool, default: True ) –

    Whether to extract an audio track and include a transcript as extra context when the provider supports transcription.

  • max_videos_to_sample

    (int, default: 5 ) –

    Upper bound on how many videos are sampled from a folder input to control cost and payload size. When a folder contains more than this many videos, a subset is sampled before frame extraction.

  • max_total_frames_payload

    (int, default: 15 ) –

    Upper bound on the total number of frames sent to the provider across the batch.

  • random_seed

    (Optional[int], default: None ) –

    Optional seed used when folder inputs need to sample a subset of videos. Pass a value here to make the sampled subset reproducible across runs.

Returns:

  • DiscoveryResult

    A single discovery payload in joint mode or a list of payloads in

  • DiscoveryResult

    pooled per-frame mode.

Raises:

  • FileNotFoundError

    If the input path is missing or a folder contains no supported video files.

  • ValueError

    If no frames can be extracted from the provided videos.