LongMemEval

The LongMemEval dataset consists of 500 human-curated question-answer pairs, with answers embedded within a scalable set of user-assistant chat histories. The dataset is designed to test beyond simple fact recall with many tasks requiring complex temporal reasoning.