Documentation Index
Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
Use this file to discover all available pages before exploring further.
aevyra-verdict accepts JSONL and CSV files. JSONL auto-detects the format from the first
record — pass format= explicitly to override. CSV uses from_csv() with configurable
column names.
OpenAI
ShareGPT
Alpaca
CSV
The native format. Used by default.{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"ideal": "The capital of France is Paris.",
"metadata": {"category": "factual", "difficulty": "easy"}
}
| Field | Required | Description |
|---|
messages | Yes | Array of {role, content} objects. Roles: system, user, assistant. |
ideal | No | Reference answer. Required for ROUGE, BLEU, and exact match metrics. |
metadata | No | Arbitrary key-value pairs for filtering and grouping. |
Common format for open source fine-tuning datasets on HuggingFace.{
"conversations": [
{"from": "system", "value": "You are a helpful assistant."},
{"from": "human", "value": "What is the capital of France?"},
{"from": "gpt", "value": "The capital of France is Paris."}
]
}
The last assistant turn is automatically extracted as the ideal reference answer
and excluded from the prompt messages sent to the model. Any extra fields outside
conversations are preserved as metadata.Supported role aliases: human / user → user, gpt / assistant / chatgpt / bard / bing → assistant. Standard instruction-following format used by Alpaca, WizardLM, and similar datasets.{
"instruction": "Translate to French.",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
instruction and input are combined into a single user message using the standard
Alpaca template. output becomes the ideal reference answer. A system field is
extracted as a system message if present. input can be omitted for instruction-only samples. Simplest format for tabular data. Column names default to input and ideal:input,ideal
"What is the capital of France?","Paris"
"Explain binary search in one sentence.","Binary search repeatedly halves a sorted array..."
Use from_csv() with optional column overrides:# Default column names (input, ideal)
dataset = Dataset.from_csv("data.csv")
# Custom column names
dataset = Dataset.from_csv("data.csv", input_field="article", output_field="summary")
# Label-free (no reference answers)
dataset = Dataset.from_csv("data.csv", output_field=None)
Missing columns raise a clear error listing the available column names.
Loading
from aevyra_verdict import Dataset
# JSONL — auto-detect format (default)
dataset = Dataset.from_jsonl("data.jsonl")
# JSONL — explicit format
dataset = Dataset.from_jsonl("sharegpt_data.jsonl", format="sharegpt")
dataset = Dataset.from_jsonl("alpaca_data.jsonl", format="alpaca")
# CSV — default column names (input, ideal)
dataset = Dataset.from_csv("data.csv")
# CSV — custom column names
dataset = Dataset.from_csv("data.csv", input_field="article", output_field="summary")
# CSV — label-free (no reference answers)
dataset = Dataset.from_csv("data.csv", output_field=None)
print(dataset.summary())
# {'name': 'data', 'num_conversations': 50, 'has_ideals': True, 'metadata_keys': ['category']}
Or inline, without a file:
dataset = Dataset.from_list([
{"messages": [{"role": "user", "content": "Hello"}], "ideal": "Hi there"},
])
# ShareGPT inline
dataset = Dataset.from_list(sharegpt_records, format="sharegpt")
Filtering
Filter by any metadata field. Multiple filters are ANDed together.
hard = dataset.filter(difficulty="hard")
reasoning = dataset.filter(category="reasoning", difficulty="hard")
CLI
Preview a dataset without running any models:
aevyra-verdict inspect examples/sample_data.jsonl