Easylink Case Study — DOT Data Labs

32M Science Q&A Dataset for LLM Training

How we helped Zonekee Easylink build a massive, clean, ready-to-train dataset of real science questions and detailed solutions — delivered in under a month.

Client Zonekee Easylink Co., Ltd.
Industry AI / EdTech
Duration < 1 month
Services Data Collection & Processing
32M
Questions & Answers Delivered
<30 days
From Start to Delivery
60%+
Human-Answered Content
5.0
Client Rating

Challenge

Zonekee Easylink Co., Ltd. was building a language model capable of understanding and solving science problems across a broad range of school-level disciplines — mathematics, physics, chemistry, biology, and more. To train their model effectively, they needed a dataset that was not only large in volume but rich in reasoning: each answer had to include a detailed, step-by-step solution that would teach the model how to think through a problem, not just output a final answer.

The scale was ambitious. They needed tens of millions of question-answer pairs — enough to cover the breadth of topics students encounter from elementary through high school science curricula. The data had to come from authentic sources where real students ask real questions and real tutors provide thorough explanations. Synthetic-only datasets wouldn't cut it: the model needed to learn from genuine human reasoning patterns.

On top of the quality requirements, there was a hard timeline constraint. The client's model training schedule was already set, and the dataset needed to be delivered in under a month — cleaned, structured, and ready to plug directly into their pipeline with zero additional preprocessing.

Our Approach

We broke the project into four distinct phases, running several of them in parallel to meet the aggressive timeline.

Phase 1

Source Identification & Infrastructure

We identified the largest Q&A platform for school students as our primary data source — a platform where millions of students post science questions and receive detailed answers from verified tutors. We built a custom scraping infrastructure using Python, Selenium for dynamic page rendering, and direct HTTP requests for static endpoints, optimizing for throughput while respecting rate limits.

Phase 2

Large-Scale Collection

We deployed distributed collection workers to extract questions, answers, subject tags, difficulty levels, and solution explanations across all science categories — math, physics, chemistry, biology, geography, and more. The pipeline was designed to handle the platform's structure and extract both the question content and the full solution chain.

Phase 3

AI-Augmented Solutions

Not every question on the platform had a complete, detailed solution. For the 30–40% of entries that lacked thorough step-by-step explanations, we used Google's Gemini to generate comprehensive solutions — carefully prompted to match the style, depth, and formatting of the human-written answers already in the dataset.

Data Quality & Cleaning

Raw data at this scale is inevitably noisy. We built a multi-stage cleaning pipeline powered by Gemini to ensure every entry in the final dataset met our quality standards:

  • Deduplication — Identified and removed duplicate questions using both exact matching and semantic similarity, eliminating redundant entries that would bias model training
  • Format standardization — Normalized mathematical notation, chemical formulas, and scientific terminology across all 32 million entries to ensure consistent formatting
  • Answer validation — Used Gemini to verify that solutions were logically coherent, that step-by-step reasoning was sound, and that final answers matched the solution process
  • Noise removal — Stripped irrelevant metadata, broken HTML fragments, advertisement remnants, and platform-specific artifacts from every entry
  • Subject classification — Validated and corrected subject tags to ensure questions were properly categorized across disciplines
  • Language quality — Flagged and corrected entries with grammatical errors, incomplete sentences, or unclear phrasing in both questions and answers

Solution

The final deliverable was a structured dataset of 32 million science Q&A pairs — one of the largest datasets of its kind on the market. Each entry contained the original question, a complete step-by-step solution, the final answer, and subject/topic metadata.

Over 60% of the solutions came from real human tutors, preserving authentic reasoning patterns and natural explanation styles. The remaining 30–40% were generated by Gemini, with prompts specifically tuned to produce solutions indistinguishable from the human-written content in style and depth. The AI-generated solutions went through the same validation pipeline as the human ones.

The dataset covered the full spectrum of school-level science: algebra, geometry, calculus, classical mechanics, thermodynamics, optics, organic and inorganic chemistry, cell biology, genetics, ecology, earth science, and more. This breadth ensures the client's model can handle questions across any science discipline students typically encounter.

Everything was delivered in a clean, standardized format ready for direct ingestion into the client's training pipeline — no additional processing, transformation, or cleanup required on their end.

★★★★★
"Oleg is very intelligent and super professional, he fully considered the requirements before the task started, delivered the finished work to me in a very quick time and even if there were revisions, they were sent to me in a very quick time. It was very heart saving cooperation. The ability to maintain good communication coupled with fast delivery was perfect."
Zonekee Easylink Co., Ltd. — via Upwork (5.0 rating)

Key Challenges

A project of this scale came with several challenges that required creative problem-solving:

  • Scale vs. speed tradeoff — Collecting 32 million entries in under 30 days required a distributed architecture with multiple concurrent workers, careful rate management, and automatic retry logic for failed requests. We optimized the pipeline continuously throughout the project, increasing throughput by 3x over the first week alone.
  • Heterogeneous content formats — Science questions often contain mathematical equations, chemical formulas, diagrams references, and mixed notation. We built custom parsers to handle LaTeX, Unicode math symbols, and chemical notation, converting everything into a consistent format.
  • AI-human quality parity — The AI-generated solutions needed to be indistinguishable from human ones. We went through several prompt iterations with Gemini, benchmarking generated solutions against human baselines until the quality gap was negligible.
  • Validation at scale — Manually reviewing even 1% of 32 million entries would mean checking 320,000 items. We built automated validation using Gemini to check logical consistency, completeness, and correctness — then manually spot-checked random samples across subjects to verify the automated checks were working.

Results

The project was completed in under one month from initial briefing to final delivery. The client received one of the largest science Q&A datasets available — clean, structured, and ready for immediate use in their LLM training pipeline.

  • 32 million question-answer pairs collected, cleaned, and delivered
  • 60%+ human-authored content ensuring authentic reasoning patterns
  • All major science disciplines covered — math, physics, chemistry, biology, earth science, and more
  • Sub-30-day delivery from project kickoff to final handoff
  • Zero preprocessing needed — dataset was plug-and-play for the client's training pipeline
  • One of the largest science Q&A datasets on the market
  • 5.0 star rating and outstanding client feedback