Curated vs. Scraped: Why Your Robot's Vocabulary Matters

2026-03-07 · Nicolette Rankin

A hospital robot greets an elderly Japanese patient with casual slang pulled from a gaming forum. A customer service agent translates a product description into Arabic but uses a dialect that carries unintended political connotations. A children's education app teaches a six-year-old a French word using an example sentence scraped from an adult novel.

These are not hypothetical failures. They are the predictable outcome of building AI language systems on top of web-scraped data.

The Scraping Problem

The default approach to multilingual AI is straightforward: crawl the web, extract bilingual text pairs, run them through a cleaning pipeline, and serve the results through an API. It is fast, cheap, and scales to hundreds of languages. It is also fundamentally unsuitable for any application where accuracy, safety, or cultural sensitivity matter.

1. Accuracy Errors

Scraped data inherits every mistake on the internet. Misspellings on product pages become dictionary entries. Machine-translated subtitles become reference translations.

Scraped result: The Spanish word "embarazada" flagged as a cognate for "embarrassed." Actual meaning: "Pregnant." One of the most documented false cognates in Spanish-English translation, still appearing in scraped datasets.

Scraped result: Japanese honorific system reduced to a single "formal/informal" toggle. Curated result: Five levels of politeness with context rules for age, social status, business relationships, and regional variation.

2. Gender Bias

Most scraped corpora default to masculine grammatical forms. When a French dataset says "il est intelligent" in every example sentence, the AI system trained on it will reproduce that bias at scale.

Scraped result: 94% of German example sentences use masculine default forms. Curated result: Balanced distribution across masculine, feminine, and neutral constructions, with explicit gender metadata on every entry.

3. Cultural Insensitivity

A word exists within networks of historical meaning, regional usage, and emotional connotation. A scraper sees text. A linguist sees meaning.

The Curated Alternative

Word Orb takes a fundamentally different approach. Every vocabulary entry across all 47 supported languages is built from source material selected and reviewed by linguists. The content pipeline is the subject of a pending U.S. patent (Application No. 18/088,519).

Source selection. Content originates from pedagogically appropriate material, not web crawls.

Linguistic review. Native speakers verify pronunciation, grammatical metadata, gender assignments, and usage context for every entry.

Gender equity filtering. Example sentences are systematically balanced across gender presentations.

AI identity awareness. Content is evaluated for appropriateness when delivered by an AI agent rather than a human teacher.

Kelly, our AI teaching companion, is the quality standard. Every entry must be something Kelly could deliver to a learner of any age, gender, or cultural background without causing confusion or harm.

What This Means for Enterprise

If you are deploying language capabilities into healthcare, education, robotics, legal services, or international customer experience, the cost of curated data is a rounding error compared to the cost of a single public failure caused by bad language data.

Enterprise API access and custom language packages are available at wordorb.ai/enterprise.


Continue Reading