Skip to content

Experiments

UCEF is evaluated through a combination of simulated experiments (for controlled ablation studies) and real benchmark experiments (for end-to-end validation).

Experimental Setup

Models

Model Context Window API Provider
GLM-4-flash 128K Zhipu AI
DeepSeek-v3 128K DeepSeek

Benchmarks

We evaluate on 8 tasks from the LongBench benchmark suite:

Task Type Description
2wikimqa_e Multi-hop QA Wikipedia-based multi-hop reasoning
hotpotqa_e Multi-hop QA HotPotQA English subset
musique Multi-step QA Multi-step reasoning questions
gov_report_e Summarization Government report summarization
narrativeqa Document QA Narrative comprehension questions
qasper_e Academic QA Research paper question answering
passage_retrieval_en_e Retrieval Passage retrieval evaluation
multifieldqa_en_e Multi-field QA Cross-domain question answering

Baselines

Method Description
Truncate Simple truncation to context budget (first N tokens)
RAG top-k TF-IDF chunk scoring + top-k selection
UCEF Full pipeline: hyperbolic scoring + quantum selection + adaptive compression

Metrics

  • ROUGE-L: F1 score based on longest common subsequence
  • Token Overlap F1: Approximate semantic similarity via token overlap
  • Latency: End-to-end pipeline time per query

Sample Size

  • 30 samples per task (240 per method per model)
  • 1,440 total LLM API calls across all experiments