Skip to content

Outputs and Schema Reference

The library writes two main artifact types:

  • Discovery JSON in outputs/
  • Generation CSV files in outputs/

Discovery JSON

The discovery helpers always write the raw provider result list to disk, even when the Python return value is simplified to a single dictionary in joint mode.

Typical path names:

  • outputs/discovered_image_features.json
  • outputs/discovered_text_features.json
  • outputs/discovered_tabular_features.json
  • outputs/discovered_video_features.json

Typical joint-discovery structure:

[
  {
    "proposed_features": [
      {
        "feature": "spice level",
        "type": "categorical",
        "description": "How spicy the dish appears or is described to be"
      },
      {
        "feature": "presentation style",
        "type": "categorical"
      }
    ]
  }
]

Notes:

  • The package expects a proposed_features collection when loading a schema for generation.
  • Each feature entry is provider-defined. Common keys are feature, name, type, and description.
  • Per-item discovery writes one list entry per input item instead of a single shared schema.
  • For video discovery with as_set=False, the package pools frames across all input videos and writes one result per extracted frame, not one result per video.
  • Folder-based video discovery samples at most max_videos_to_sample videos before extraction; pass random_seed to make that subset reproducible.

Generation CSV

Generation creates one CSV per class folder, named <class_name>_feature_values.csv.

Column layout:

Column Meaning
File Source file name, or filename__row_<n> for tabular row-level outputs
Class Class folder name, or row-level label override when label_column is provided
<feature columns> One column per discovered feature
raw_llm_output Raw JSON payload returned by the provider for traceability

Example:

File,Class,spice level,presentation style,raw_llm_output
review1.txt,positive,high,refined,"{""features"": {""spice level"": ""high"", ""presentation style"": ""refined""}}"

If merge_to_single_csv=True, the package also writes outputs/all_feature_values.csv unless you override merged_csv_name.

Schema loading rules

load_discovered_features normalizes these cases into one dictionary shape:

  • a dictionary that already contains proposed_features
  • a single-item list containing that dictionary
  • a list of feature entries without the outer dictionary, which is wrapped automatically

This means generation code can rely on a single in-memory schema form even when provider outputs vary slightly.