PHINEAS
AI teaching assistant using corpus linguistics to rewrite complex text at target reading levels — now in beta at an ESL academy in Singapore
Published:
Context
English learners and readers with accessibility needs struggle with text complexity. Existing “simplification” tools are crude — they swap long words for short ones without understanding usage frequency or semantic context. A word isn’t hard because it’s long; it’s hard because learners haven’t encountered it yet.
Approach
Built on the COCA (Corpus of Contemporary American English) word frequency database — 1 billion words of real English usage data:
- Corpus Architecture: Embedded COCA frequency database via OpenAI for semantic search across 60,000+ ranked vocabulary items
- Analysis Engine: Model identifies words above target frequency thresholds based on CEFR proficiency levels (A1–C2)
- Rewrite System: Intelligent substitution replaces complex vocabulary with accessible alternatives, preserving meaning and sentence structure
- SME Workflow: Fine-tuning pipeline designed for subject matter experts (ESL teachers) to contribute training examples without requiring technical skills — currently compiling 120-example training batch
Outcome
Currently deployed as a Google Gem (prompt + lexical database) in beta testing at a partner ESL academy in Singapore. Shared with select staff and students while training samples are compiled for the first batch fine-tuning run. Core analysis and rewrite functionality validated. Exceeds original project KPIs for accuracy.
Key Insight
The hard problem isn't the AI — it's the corpus data architecture. And making fine-tuning accessible to non-technical SMEs is a product design challenge, not an engineering challenge.
Portfolio Signal
- ◈ Domain expertise (education/linguistics — COCA corpus, CEFR standards)
- ◈ Embeddings implementation on real-world structured data
- ◈ Fine-tuning workflow design for non-technical contributors
- ◈ Phased delivery: working product in users' hands before optimization
- ◈ International deployment with real user feedback loop
Corporate Translation
Three skills in one project: (a) technical build with embeddings and fine-tuning, (b) PM discipline with phased delivery and measurable KPIs, (c) product management with a real beta program generating real user data.