LongBench Results
Main Results
We evaluate UCEF on 8 LongBench tasks with 30 samples per task, using GLM-4-flash and DeepSeek-v3 as backbone models. The context budget is set to 4,000 tokens (simulating a small-context model processing long documents).
Overall Comparison
| Model |
Method |
Avg ROUGE-L |
Avg TokenF1 |
vs RAG |
| GLM-4-flash |
Truncate |
0.1433 |
0.1633 |
+6.9% |
| GLM-4-flash |
RAG top-k |
0.1340 |
0.1498 |
— |
| GLM-4-flash |
UCEF |
0.1479 |
0.1671 |
+10.3% |
| DeepSeek-v3 |
Truncate |
0.1889 |
0.2040 |
+5.0% |
| DeepSeek-v3 |
RAG top-k |
0.1800 |
0.1919 |
— |
| DeepSeek-v3 |
UCEF |
0.2146 |
0.2306 |
+19.3% |
Per-Task Breakdown (DeepSeek-v3)
| Task |
Truncate |
RAG |
UCEF |
Best |
| 2wikimqa_e |
0.2273 |
0.2095 |
0.2697 |
UCEF |
| hotpotqa_e |
0.4402 |
0.3660 |
0.4133 |
Trunc |
| musique |
0.0960 |
0.0941 |
0.1418 |
UCEF |
| gov_report_e |
0.0707 |
0.0545 |
0.0602 |
Trunc |
| narrativeqa |
0.0939 |
0.1751 |
0.2066 |
UCEF |
| qasper_e |
0.1521 |
0.0571 |
0.1615 |
UCEF |
| passage_retrieval_en_e |
0.0086 |
0.0086 |
0.0083 |
— |
| multifieldqa_en_e |
0.4225 |
0.4748 |
0.4558 |
RAG |
Per-Task Breakdown (GLM-4-flash)
| Task |
Truncate |
RAG |
UCEF |
Best |
| 2wikimqa_e |
0.2109 |
0.2062 |
0.2079 |
Trunc |
| hotpotqa_e |
0.2195 |
0.1796 |
0.2131 |
Trunc |
| musique |
0.0330 |
0.0336 |
0.0458 |
UCEF |
| gov_report_e |
0.0950 |
0.0819 |
0.0840 |
Trunc |
| narrativeqa |
0.0826 |
0.1369 |
0.1391 |
UCEF |
| qasper_e |
0.1068 |
0.0461 |
0.0949 |
Trunc |
| passage_retrieval_en_e |
0.0000 |
0.0000 |
0.0000 |
— |
| multifieldqa_en_e |
0.3989 |
0.3879 |
0.3979 |
Trunc |
Statistical Significance
Paired statistical tests comparing UCEF vs RAG across all 240 samples:
| Model |
Test |
p-value |
Significant? |
| GLM-4-flash |
Wilcoxon |
0.288 |
No |
| GLM-4-flash |
Paired t-test |
0.225 |
No |
| DeepSeek-v3 |
Wilcoxon |
0.011 |
✅ Yes |
| DeepSeek-v3 |
Paired t-test |
0.007 |
✅ Yes |
Key Observations
- UCEF consistently outperforms RAG on multi-hop QA tasks (2wikimqa, musique) where document inter-relationships matter
- Truncation is surprisingly strong on tasks where answer-relevant content appears early in documents
- UCEF excels on DeepSeek-v3 — the stronger backbone model amplifies the benefit of better context selection
- Passage retrieval scores near zero for all methods, suggesting this task requires exact-match retrieval not suited to any compression approach
Latency
| Component |
GLM-4-flash |
DeepSeek-v3 |
| Context processing |
~2ms |
~1ms |
| LLM generation |
~1,500ms |
~900ms |
| Total per query |
~2,000ms |
~1,100ms |