Discover API
Autogenerated reference for the discovery module.
discover
Public discovery helpers for multimodal feature schema generation.
The functions in this module accept raw inputs or folders on disk, delegate the actual reasoning to a provider, and persist the discovered schema as JSON in an output directory. Discovery is intentionally folder-oriented so the same API can be used from notebooks, scripts, and batch pipelines.
Functions:
-
discover_features_from_images–Discover features from image files and persist the provider response.
-
discover_features_from_tabular–Discover features from tabular datasets by projecting a text column.
-
discover_features_from_texts–Discover features from text strings, files, or folders of documents.
-
discover_features_from_videos–Discover features from one or more videos.
discover_features_from_images(image_paths_or_folder: str | List[str], prompt: str = image_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None) -> DiscoveryResult
Discover features from image files and persist the provider response.
Parameters:
-
(image_paths_or_folderstr | List[str]) –A single image path, a folder containing images, or a list of image file paths.
-
(promptstr, default:image_discovery_prompt) –System-style prompt passed through to the provider.
-
(providerOptional[OpenAIProvider], default:None) –Optional provider instance. When omitted, an OpenAIProvider is created from environment variables.
-
(as_setbool, default:True) –When
True, all images are analyzed together and a single shared feature schema is produced. WhenFalse, each image is sent independently and the result contains one entry per image. -
(output_dirstr | Path, default:'outputs') –Directory where the JSON artifact should be written.
-
(output_filenameOptional[str], default:None) –Custom filename for the saved artifact. Defaults to
discovered_image_features.json.
Returns:
-
DiscoveryResult–A single discovery payload in joint mode or a list of payloads in
-
DiscoveryResult–per-image mode. The on-disk JSON always preserves the raw provider
-
DiscoveryResult–result list.
Raises:
-
FileNotFoundError–If the provided path does not exist.
-
ValueError–If no supported image files are found.
-
RuntimeError–If image decoding fails for every candidate input.
discover_features_from_tabular(file_or_folder: str | Path, text_column: str, provider: Optional[OpenAIProvider] = None, prompt: str = text_discovery_prompt, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None, max_rows: Optional[int] = None) -> DiscoveryResult
Discover features from tabular datasets by projecting a text column.
Supported files are loaded into a single DataFrame, the selected text column is extracted, and the resulting list of strings is delegated to discover_features_from_texts.
Parameters:
-
(file_or_folderstr | Path) –A single tabular file or a directory containing supported tabular files.
-
(text_columnstr) –Column name whose values should be used as textual input for discovery.
-
(providerOptional[OpenAIProvider], default:None) –Optional provider instance.
-
(promptstr, default:text_discovery_prompt) –Prompt passed through to the provider.
-
(as_setbool, default:True) –Whether to discover one shared schema across all sampled rows or process rows independently.
-
(output_dirstr | Path, default:'outputs') –Directory where the JSON artifact should be written.
-
(output_filenameOptional[str], default:None) –Custom filename for the saved artifact. Defaults to
discovered_tabular_features.json. -
(max_rowsOptional[int], default:None) –Optional cap on how many rows are used from the concatenated dataset.
Returns:
-
DiscoveryResult–The same return shape as
-
DiscoveryResult–
Raises:
-
FileNotFoundError–If the provided path does not exist.
-
ValueError–If no supported tabular files are found or
text_columnis missing.
discover_features_from_texts(texts_or_file: str | List[str], prompt: str = text_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None) -> DiscoveryResult
Discover features from text strings, files, or folders of documents.
Parameters:
-
(texts_or_filestr | List[str]) –Either a raw text string, a list of raw text strings, a single supported document path, or a directory containing supported text documents. String inputs are treated as paths only when they already exist on disk or look path-like, such as
notes/file.txt. -
(promptstr, default:text_discovery_prompt) –Prompt passed through to the provider.
-
(providerOptional[OpenAIProvider], default:None) –Optional provider instance. Defaults to OpenAIProvider.
-
(as_setbool, default:True) –When
True, all extracted text is combined into a single request so the provider can discover a shared schema. WhenFalse, each text chunk is processed independently. -
(output_dirstr | Path, default:'outputs') –Directory where the JSON artifact should be written.
-
(output_filenameOptional[str], default:None) –Custom filename for the saved artifact. Defaults to
discovered_text_features.json.
Returns:
-
DiscoveryResult–A single discovery payload in joint mode or a list of payloads in
-
DiscoveryResult–per-text mode.
Raises:
-
FileNotFoundError–If a path-like input does not exist.
-
ValueError–If the path is invalid or no supported text input can be extracted.
discover_features_from_videos(videos_or_folder: str | List[str], prompt: str = image_discovery_prompt, provider: Optional[OpenAIProvider] = None, as_set: bool = True, num_frames: int = 5, output_dir: str | Path = 'outputs', output_filename: Optional[str] = None, use_audio: bool = True, max_videos_to_sample: int = 5, max_total_frames_payload: int = 15, random_seed: Optional[int] = None) -> DiscoveryResult
Discover features from one or more videos.
Each video is converted into representative frames and, optionally, an audio transcript. The resulting multimodal payload is sent to the provider and the raw response is written to JSON.
Parameters:
-
(videos_or_folderstr | List[str]) –A single video path, a folder containing videos, or a list of video file paths.
-
(promptstr, default:image_discovery_prompt) –Prompt passed through to the provider.
-
(providerOptional[OpenAIProvider], default:None) –Optional provider instance implementing
image_featuresand, whenuse_audio=True, optionallytranscribe_audio. -
(as_setbool, default:True) –When
True, all extracted frames are analyzed together to produce one shared schema. WhenFalse, all extracted frames are pooled together and analyzed individually, so the returned list has one entry per extracted frame rather than one entry per source video. -
(num_framesint, default:5) –Target number of key frames to extract per video before downsampling across the batch.
-
(output_dirstr | Path, default:'outputs') –Directory where the JSON artifact should be written.
-
(output_filenameOptional[str], default:None) –Custom filename for the saved artifact. Defaults to
discovered_video_features.json. -
(use_audiobool, default:True) –Whether to extract an audio track and include a transcript as extra context when the provider supports transcription.
-
(max_videos_to_sampleint, default:5) –Upper bound on how many videos are sampled from a folder input to control cost and payload size. When a folder contains more than this many videos, a subset is sampled before frame extraction.
-
(max_total_frames_payloadint, default:15) –Upper bound on the total number of frames sent to the provider across the batch.
-
(random_seedOptional[int], default:None) –Optional seed used when folder inputs need to sample a subset of videos. Pass a value here to make the sampled subset reproducible across runs.
Returns:
-
DiscoveryResult–A single discovery payload in joint mode or a list of payloads in
-
DiscoveryResult–pooled per-frame mode.
Raises:
-
FileNotFoundError–If the input path is missing or a folder contains no supported video files.
-
ValueError–If no frames can be extracted from the provided videos.