Experiments¶

UCEF is evaluated through a combination of simulated experiments (for controlled ablation studies) and real benchmark experiments (for end-to-end validation).

Experimental Setup¶

Models¶

Model	Context Window	API Provider
GLM-4-flash	128K	Zhipu AI
DeepSeek-v3	128K	DeepSeek

Benchmarks¶

We evaluate on 8 tasks from the LongBench benchmark suite:

Task	Type	Description
2wikimqa_e	Multi-hop QA	Wikipedia-based multi-hop reasoning
hotpotqa_e	Multi-hop QA	HotPotQA English subset
musique	Multi-step QA	Multi-step reasoning questions
gov_report_e	Summarization	Government report summarization
narrativeqa	Document QA	Narrative comprehension questions
qasper_e	Academic QA	Research paper question answering
passage_retrieval_en_e	Retrieval	Passage retrieval evaluation
multifieldqa_en_e	Multi-field QA	Cross-domain question answering

Baselines¶

Method	Description
Truncate	Simple truncation to context budget (first N tokens)
RAG top-k	TF-IDF chunk scoring + top-k selection
UCEF	Full pipeline: hyperbolic scoring + quantum selection + adaptive compression

Metrics¶

ROUGE-L: F1 score based on longest common subsequence
Token Overlap F1: Approximate semantic similarity via token overlap
Latency: End-to-end pipeline time per query

Sample Size¶

30 samples per task (240 per method per model)
1,440 total LLM API calls across all experiments