Research - Oshayer Siddique

Research Portfolio

My research explores the fundamental mechanisms of reasoning in Large Language Models, with a particular focus on scientific problem-solving and the development of robust evaluation frameworks for AI reasoning capabilities.

Research Impact

19,609

Physics Problems in Dataset

ACL Findings Publication

World's

Largest Physics Reasoning Dataset

Open

Source Contributions

Published Research

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Authors: Oshayer Siddique et al.
Venue: ACL Findings 2025 (The Association for Computational Linguistics)
Publication Date: December 2025

Novel Inference Techniques Physics Reasoning LLM Evaluation Benchmark Dataset

This groundbreaking research introduces novel inference-time techniques specifically designed to enhance Large Language Models' ability to reason through complex physics problems. The study presents the PhysicsEval dataset—the world's largest physics reasoning benchmark with 19,609 carefully curated problems—and demonstrates significant improvements in LLM performance without modifying underlying model architectures. The work provides critical insights into current model limitations and establishes a practical framework for building more robust AI systems capable of scientific reasoning. Our methodology bridges the gap between pattern recognition and genuine logical reasoning, offering a new paradigm for evaluating and improving AI performance in scientific domains.

View Paper (arXiv)

ResearchGate

Research Resources & Contributions

PhysicsEval Dataset: The World's Largest Physics Reasoning Benchmark

Repository: Hugging Face Datasets
Size: 19,609 physics problems with solutions
Availability: Open source, publicly accessible

19,609 Problems Multiple Physics Domains Verified Solutions Community Resource

The PhysicsEval dataset represents the most comprehensive physics reasoning benchmark ever created for AI evaluation. Sourced from diverse physics textbooks and verified through educational forums, this dataset challenges models to apply fundamental principles and logical reasoning rather than relying on pattern matching. The dataset spans multiple physics domains including mechanics, thermodynamics, electromagnetism, and quantum physics, providing researchers with an unprecedented tool for evaluating and advancing AI reasoning capabilities. Each problem includes detailed solutions and step-by-step reasoning paths, making it invaluable for both evaluation and training purposes.

Access Dataset

Current Research Direction

Investigating the Core Mechanisms of LLM Reasoning

Status: Ongoing Research
Focus: Fundamental reasoning components in Large Language Models
Expected Outcome: Novel insights into AI reasoning architecture

Reasoning Analysis Model Interpretability Cognitive Architecture AI Transparency

My current research addresses a fundamental question in artificial intelligence: What constitutes the core of reasoning in Large Language Models? This investigation aims to deconstruct and analyze the specific components and mechanisms within LLMs that enable their logical and problem-solving capabilities. By understanding these underlying processes, we can develop more transparent, reliable, and effective AI systems. The research involves systematic analysis of attention patterns, layer-wise reasoning development, and the emergence of logical structures during training. This work promises to provide unprecedented insights into how artificial intelligence systems develop and apply reasoning abilities, potentially leading to breakthrough advances in interpretable AI.

Research Areas

Physics-Informed AI

Developing AI systems that understand and apply fundamental physics principles, bridging the gap between symbolic reasoning and neural computation.

LLM Reasoning Mechanisms

Investigating the internal processes that enable reasoning in large language models, focusing on interpretability and cognitive architectures.

Scientific Problem Solving

Creating robust evaluation frameworks and methodologies for assessing AI performance on complex scientific and mathematical problems.

Benchmark Development

Designing comprehensive datasets and evaluation metrics that push the boundaries of current AI capabilities and reveal fundamental limitations.