Easylink Case Study — DOT Data Labs

1M+ Commercial Real Estate Listings for AI-Powered Market Analysis

How we built a large-scale extraction pipeline across 30+ CRE platforms to fuel a leading firm's AI market analysis model — with ongoing data delivery.

Client Under NDA
Industry Commercial Real Estate
Duration 3+ months (ongoing)
Services Web Scraping & Data Engineering
1M+
Property Listings Extracted
30+
CRE Platforms Covered
Full
Listing Data Fields
Ongoing
Data Delivery Pipeline

Challenge

A leading commercial real estate firm needed a comprehensive property listing dataset to train an AI model for market analysis — predicting trends, identifying undervalued assets, and surfacing investment opportunities across the U.S. market. The model required rich, structured data from as many sources as possible to capture a complete picture of the CRE landscape.

The client's requirements were demanding: full listing data (pricing, square footage, cap rates, NOI, location, property type, tenant information, photos, descriptions, zoning data) extracted from 30+ major CRE platforms — each with its own site structure, data format, and anti-scraping defenses. The initial delivery needed to include over 1 million property records, with an ongoing pipeline to keep the dataset fresh as new listings appeared and existing ones changed.

The core challenge was threefold: building and maintaining reliable scrapers across 30+ platforms that each evolved independently, normalizing wildly different data formats into a single consistent schema, and deduplicating the same properties that appeared across multiple platforms — all while scaling to handle over a million records and keeping the pipeline running continuously.

Our Approach

Phase 1

Multi-Platform Extraction

Built custom scraping pipelines for each of the 30+ CRE platforms, each tailored to the source's unique structure and anti-bot measures. The infrastructure used rotating proxies, intelligent request throttling, and headless browsers where needed to ensure reliable, uninterrupted data collection at scale.

Phase 2

Normalization & Deduplication

Developed a unified data schema and built automated normalization pipelines to map each platform's fields into a single consistent format. Cross-platform deduplication matched the same properties across sources using address matching, geolocation, and property attributes to produce a clean, deduplicated master dataset.

Phase 3

Ongoing Delivery & Monitoring

Set up a continuous delivery pipeline that regularly refreshed the dataset with new listings, price changes, and status updates. Built monitoring and alerting for scraper health across all 30+ sources — automatically detecting when a platform changed its structure and flagging scrapers that needed updates.

Solution

We delivered an initial dataset of over 1 million commercial property listings — fully normalized, deduplicated, and enriched with complete listing data from 30+ CRE platforms. Every record included financial metrics (asking price, cap rate, NOI, price per sqft), physical attributes (square footage, lot size, year built, property type), location data (address, coordinates, market area), tenant and lease information where available, listing descriptions, and photos.

All data was mapped into a unified schema regardless of source platform, meaning the client's AI pipeline could ingest the entire dataset without any additional preprocessing or field mapping. Cross-platform deduplication ensured each unique property appeared only once in the master dataset, with the most complete and current information aggregated from all sources where that property was listed.

Beyond the initial delivery, we established a continuous update pipeline. New listings, price changes, and status updates (sold, pending, withdrawn) were captured and delivered on a regular schedule, keeping the client's training data current and enabling their model to learn from the latest market dynamics rather than stale historical snapshots.

Key Challenges

  • Anti-bot defenses across 30+ platforms — Each CRE platform employed different anti-scraping measures: CAPTCHAs, IP rate limiting, JavaScript rendering requirements, session fingerprinting, and behavioral analysis. We built platform-specific strategies using rotating residential proxies, request pacing, headless browser rendering, and session management to maintain reliable access without triggering blocks.
  • Data normalization at scale — Every platform structured its data differently. The same concept — like "cap rate" — might be labeled, formatted, and calculated differently across sources. Some platforms listed square footage as "SF," others as "sqft" or embedded it in descriptions. We built a normalization layer with platform-specific parsers that mapped each source's idiosyncratic format into a single clean schema with consistent units, naming, and data types.
  • Cross-platform deduplication — The same property often appeared on 5–10 platforms simultaneously, each with slightly different data. Simple address matching wasn't enough — listings used different address formats, sometimes included suite numbers, and occasionally had typos. We built a fuzzy matching pipeline combining normalized address comparison, geolocation proximity, and property attribute similarity to identify and merge duplicate listings while preserving the most complete data from each source.
  • Maintaining 30+ scrapers — CRE platforms regularly updated their site structures, broke existing selectors, added new anti-bot measures, or reorganized their data layouts. Managing 30+ independent scrapers meant building automated health monitoring that detected extraction failures early, alerting our team to fix affected scrapers before data gaps appeared. This was an ongoing operational challenge, not a one-time engineering problem.
  • Data quality and completeness — Many listings had incomplete information — missing cap rates, no NOI figures, vague descriptions, or outdated pricing. We built quality scoring for each record and implemented cross-source enrichment: when one platform had a field that another was missing, we merged the data to create the most complete record possible for each property.

Results

  • 1,000,000+ commercial property listings extracted and processed
  • 30+ CRE platforms covered with custom scraping pipelines
  • Full listing data — financials, location, descriptions, photos, tenant info, zoning
  • Unified schema — all sources normalized into a single consistent format
  • Cross-platform deduplication — fuzzy matching eliminated duplicate listings across sources
  • Continuous delivery — ongoing pipeline keeping the dataset current with new listings and updates
  • Training-ready — structured for direct ingestion into the client's AI market analysis model