DOT Data Labs
Dataset

ScienceQA-32M

32M science Q&A pairs across physics, chemistry, biology, and math — verified for LLM training.

32,000,000
Pairs
Physics, Chemistry, Biology, Math, Earth Science
Subjects
60%+ of solutions
Human-authored
EN, ZH (partial)
Languages

Thirty-two million curated science question-answer pairs sourced from K-12 and undergraduate study platforms. Each pair includes the original question, a step-by-step worked solution, the final answer, and subject/topic metadata. Used to fine-tune LLMs for STEM tutoring, reasoning, and answer-grounded generation.

Tags

Q&AScienceSTEMLLM TrainingReasoning

Delivery formats

  • JSONL
  • Parquet
  • HuggingFace dataset format

License

Commercial AI training license, perpetual.

Data sample

What a record looks like

Sample Q&A pair

JSONIllustrative — full sample available under NDA
{
  "id": "sci_8240412",
  "subject": "Physics",
  "topic": "Kinematics",
  "question": "A ball is dropped from 45 m. How long until it hits the ground?",
  "solution": "Use h = 1/2 g t^2. Solve for t: t = sqrt(2h/g) = sqrt(90/9.8) ≈ 3.03 s.",
  "answer": "≈ 3.03 seconds"
}
← Back to all datasets