Outputs and Schema Reference
The library writes two main artifact types:
- Discovery JSON in
outputs/ - Generation CSV files in
outputs/
Discovery JSON
The discovery helpers always write the raw provider result list to disk, even when the Python return value is simplified to a single dictionary in joint mode.
Typical path names:
outputs/discovered_image_features.jsonoutputs/discovered_text_features.jsonoutputs/discovered_tabular_features.jsonoutputs/discovered_video_features.json
Typical joint-discovery structure:
[
{
"proposed_features": [
{
"feature": "spice level",
"type": "categorical",
"description": "How spicy the dish appears or is described to be"
},
{
"feature": "presentation style",
"type": "categorical"
}
]
}
]
Notes:
- The package expects a
proposed_featurescollection when loading a schema for generation. - Each feature entry is provider-defined. Common keys are
feature,name,type, anddescription. - Per-item discovery writes one list entry per input item instead of a single shared schema.
- For video discovery with
as_set=False, the package pools frames across all input videos and writes one result per extracted frame, not one result per video. - Folder-based video discovery samples at most
max_videos_to_samplevideos before extraction; passrandom_seedto make that subset reproducible.
Generation CSV
Generation creates one CSV per class folder, named <class_name>_feature_values.csv.
Column layout:
| Column | Meaning |
|---|---|
File |
Source file name, or filename__row_<n> for tabular row-level outputs |
Class |
Class folder name, or row-level label override when label_column is provided |
<feature columns> |
One column per discovered feature |
raw_llm_output |
Raw JSON payload returned by the provider for traceability |
Example:
File,Class,spice level,presentation style,raw_llm_output
review1.txt,positive,high,refined,"{""features"": {""spice level"": ""high"", ""presentation style"": ""refined""}}"
If merge_to_single_csv=True, the package also writes outputs/all_feature_values.csv unless you override merged_csv_name.
Schema loading rules
load_discovered_features normalizes these cases into one dictionary shape:
- a dictionary that already contains
proposed_features - a single-item list containing that dictionary
- a list of feature entries without the outer dictionary, which is wrapped automatically
This means generation code can rely on a single in-memory schema form even when provider outputs vary slightly.