Built in 25 Days
How a lean team shipped a multi-AI production tax engine — 10,700 lines of extraction logic, a 16-tool autonomous agent, and psychographic personalization from 1,885 real interactions — in under a month.
The Timeline
From first commit to production multi-AI system. Every week compounded on the last.
- Next.js App Router + Supabase auth scaffolding
- 12 core database tables with RLS policies
- Document upload infrastructure + viewer
- Organizer flow (taxpayer info collection)
- Referral system + OG dynamic preview cards
- 22-status admin pipeline with 6-bucket view
- Stripe billing integration (on-demand checkout sessions)
- Cal.com meeting sync cron (daily at 2 AM UTC)
- Mobile-first admin UI (3px color track, tap-to-expand)
- Growth analytics dashboard with weekly metrics
- Atlas AI assistant — SSE streaming + 16 tools + memory
- Azure Document Intelligence for W-2, 1040, 1099 family
- 3-tier extraction pipeline: Azure DI + Azure CU + Claude fallback
- 42+ IRS form types with custom normalizers (10,700 LOC)
- Voice agent (Vapi integration, conversational mode)
- 30 unit tests + 16 Playwright E2E tests shipped same week
- Psychographic intelligence: 5 emotional archetypes from 705 real calls
- Gemini Flash reasoning layer (12 specialized tax use cases)
- Langfuse prompt versioning + A/B testing (20/80 split)
- Confidence scoring per extracted field (0.0–1.0)
- Alert system: 8 risk flag types with severity tiers
- Post-filing dashboard + tax planning page + Learn tab
- 575 test cases across unit + E2E + Lighthouse CI
March 20, 5:26 PM — The system shifted from "Claude does everything" to a three-tier extraction architecture. Azure Document Intelligence for structure, Azure Content Understanding for semantics, Claude only as fallback. This single decision cut extraction cost by 50–70% and improved accuracy to 99%+ on standard forms.
The Extraction Engine
10,700 lines of domain-specific tax logic — not a wrapper around an LLM. Custom normalizers for every IRS form type. Cross-document validation. Confidence scoring per field. This is the hardest piece to replicate.
Architecture
Form Coverage
| Document Family | Types Covered | Key Fields |
|---|---|---|
| W-2 | W-2, W-2c, W-4 | All 22 Box 12 codes (D=401k, AA=Roth, W=HSA, Z=409A, V=NSO), multi-state, locality |
| 1040 Family | 1040, 1040-SR, 1040-NR + Schedules A/B/C/D/E/SE/1/2/3 | 180+ field mapping, derived effective tax rate, AMT flag |
| 1099 Family | INT, DIV, NEC, MISC, R, B, G, K, SSA, SA, S, DA + 10 more | Transaction-level wash sale, RSU $0 cost basis detection, qualified vs ordinary dividends |
| 1098 Family | 1098, 1098-E, 1098-T | Mortgage interest, student loan, tuition + AOC eligibility gating |
| Other | K-1, 1095-A/C, paystub, ID docs, bank statements, receipts | YTD 401k tracking, over-contribution risk flag, employer match detection |
| Custom Analyzers | Form 8949, Form 8889, state returns | Gen AI-powered (trained CU analyzers), not generic OCR |
Normalizer Depth
| File | LOC | What It Does |
|---|---|---|
azure-cu-normalizers.ts | 1,311 | W-2 + 1099 combo via Content Understanding field names |
azure-normalizers-income.ts | 1,261 | 15 income form normalizers via Document Intelligence |
azure-cu-normalizers-income.ts | 1,127 | 1099 family via CU (different field naming than DI) |
azure-normalizers-1040.ts | 996 | 1040 main form + all schedules via DI |
extractor.ts | 986 | Unified extraction orchestrator with retry + timeout strategy |
409A failure code Z detection — triggers 20% penalty + deferred income taxation if missed. Most competitors map 5–8 Box 12 codes. We map all 22. RSU $0 cost basis flagging on 1099-B catches double-tax risk that costs clients $3,000+. Box 14 SDI/FLI regex handles OCR variants (CASDI, NJSDI, NYSDI) for state deductibility routing across CA/NJ/NY/WA.
Validation & Quality
20+ form-specific validation rules catch extraction errors before they reach the CPA. 8 cross-document checks catch inconsistencies across the full return.
Form-Level Checks
Cross-Document Checks
Atlas AI Agent
A 16-tool autonomous agent that fills organizer forms, books meetings, tracks IRS refunds, submits revision requests, and approves returns for e-filing. It acts on behalf of the user — it doesn't just talk.
Tool Ecosystem
| Tool | What It Does | Writes to DB |
|---|---|---|
fill_organizer_field | Saves to 4 tables (profiles, tax_profiles, addresses, income_info) | Yes |
approve_draft | Marks return approved, notifies expert, triggers e-file pipeline | Yes |
show_booking | Embeds Cal.com widget inline, updates return status | Yes |
submit_revision_request | Records revision, updates pipeline status | Yes |
fill_questionnaire_events | Batch yes/no for life events + financial events | Yes |
get_tax_estimate | Real-time federal tax calculation from extracted data | Read |
track_irs_refund | Triggers background IRS refund check | Yes |
suggest_upload_category | Opens file picker for specific document type | Read |
navigate_organizer_section | Moves sidebar to specific section (11 options) | Read |
suggest_replies | Renders quick-reply chips in chat UI | Read |
list_uploaded_documents | Current documents with extraction status | Read |
get_missing_documents | Personalized missing doc list by income type | Read |
check_return_status | Current pipeline stage + task completion | Read |
get_organizer_status | Fields filled vs missing across all sections | Read |
trigger_external_workflow | Starts background async job (IRS checks) | Yes |
contact_expert | Escalation to human CPA/EA | Yes |
Context Assembly
Before every response, Atlas loads 8 context layers in parallel via Promise.all(). Total assembly time: ~200–300ms.
Streaming & Performance
SSE streaming opens the HTTP connection ~400ms before context assembly completes — the user sees the typing indicator immediately. Bidirectional sync: form fields update live as the user talks. Ephemeral prompt caching (5-minute TTL) saves ~80% on input tokens for rapid-fire conversations.
Every dollar amount Atlas cites must come from database context, never from model weights. The 40 IRS tax rules are loaded as ground truth — Atlas never guesses a deduction limit or bracket threshold. This is how you build trust with anxious immigrants handling $200K+ in W-2 income.
The Intelligence Layer
Psychographic Intelligence
5 emotional archetypes detected from 705 real client calls (Fireflies + RingCentral), not personas invented in a workshop. Atlas adapts tone, urgency, and detail level per archetype. Refinement runs every 5 messages based on emotional signals.
Alert System
8 risk flag types auto-detected from extracted data. Severity tiers prevent alert fatigue. Anti-contradiction rules ensure Atlas never contradicts CPA advice.
| Alert | Detection Logic | Why It Matters |
|---|---|---|
| 401(k) Over-Contribution | Sum Box 12 code D across all W-2s > $23,500 | 6% excise tax on excess if not corrected |
| HSA Over-Contribution | Sum Box 12 code W vs family/individual limit | 6% excise tax + taxable income |
| RSU $0 Cost Basis | 1099-B transactions with cost_basis = 0 | Double-tax risk: $3,000+ per vesting event |
| Wash Sale Accumulation | Sum wash_sale_loss_disallowed across all 1099-Bs | Overstated loss = IRS audit trigger |
| FBAR / FATCA Required | Foreign account indicators in extraction | $10,000+ penalty per unreported account |
| Multi-State Filing | Multiple state entries across W-2s | Incorrect allocation = state audit |
| Backdoor Roth Opportunity | AGI > $161K + no IRA distribution | Missing $1,500+ annual tax savings |
| Underpayment Penalty Risk | Withheld < 90% of estimated liability | Avoid penalty via Q4 estimated payment |
Prompt Management
Langfuse Versioning
Emotional Scoring
Production Infrastructure
Integrations
| Service | Purpose | Status |
|---|---|---|
| Stripe | Dynamic checkout sessions, webhook payment confirmation, split invoicing | Production |
| Cal.com | Meeting scheduling, daily cron sync, auto-status (confirmed/completed/cancelled) | Production |
| Resend | Transactional email (welcome, referral, expert notify), archetype-aware copy | Production |
| Langfuse | Prompt versioning, A/B testing, conversation tracing, emotional scoring | Production |
| Upstash Redis | Rate limiting across serverless instances (5/min, 80/hr, 200/day) | Production |
| Sentry | Error tracking + performance monitoring + release health | Production |
Security
Data Protection
Rate Limiting
Cost Efficiency
Prompt Caching
Model Selection
Testing
575 test cases across Vitest unit tests, Playwright E2E scenarios, and Lighthouse CI. Seeded test clients (w2-simple, freelancer, missing-docs) for repeatable E2E runs. GitHub Actions CI on every push.
The Multi-AI Orchestra
"The secret sauce isn't one AI model — it's the orchestration. Six specialized engines, each playing its role in a carefully choreographed system."
Each service is chosen for what it does best. Azure for structured document understanding. Claude for nuanced conversation. Gemini for fast reasoning. ElevenLabs for natural voice. Langfuse for iteration speed. The orchestration layer is the moat — not any single model.
What's Next
Everything built so far — the extraction engine, the agent architecture, the psychographic layer — is a horizontal capability. Tax was the proving ground. The intelligence layer is the product.
"We built this in 25 days with a lean team. Imagine what happens with 12 months and a platform thesis."