Volume 4 · Engineering Intelligence

Built in 25 Days

How a lean team shipped a multi-AI production tax engine — 10,700 lines of extraction logic, a 16-tool autonomous agent, and psychographic personalization from 1,885 real interactions — in under a month.

10.7K

Lines of Extraction Logic

42+

IRS Form Types

AI Agent Tools

575

Test Cases

The Timeline

From first commit to production multi-AI system. Every week compounded on the last.

Week 1 — Foundation

March 1–7, 2026

Next.js App Router + Supabase auth scaffolding
12 core database tables with RLS policies
Document upload infrastructure + viewer
Organizer flow (taxpayer info collection)
Referral system + OG dynamic preview cards

Week 2 — Pipeline

March 8–14, 2026

22-status admin pipeline with 6-bucket view
Stripe billing integration (on-demand checkout sessions)
Cal.com meeting sync cron (daily at 2 AM UTC)
Mobile-first admin UI (3px color track, tap-to-expand)
Growth analytics dashboard with weekly metrics

Week 3 — Intelligence

March 15–21, 2026

Atlas AI assistant — SSE streaming + 16 tools + memory
Azure Document Intelligence for W-2, 1040, 1099 family
3-tier extraction pipeline: Azure DI + Azure CU + Claude fallback
42+ IRS form types with custom normalizers (10,700 LOC)
Voice agent (Vapi integration, conversational mode)
30 unit tests + 16 Playwright E2E tests shipped same week

Week 4 — Orchestration

March 22–26, 2026

Psychographic intelligence: 5 emotional archetypes from 705 real calls
Gemini Flash reasoning layer (12 specialized tax use cases)
Langfuse prompt versioning + A/B testing (20/80 split)
Confidence scoring per extracted field (0.0–1.0)
Alert system: 8 risk flag types with severity tiers
Post-filing dashboard + tax planning page + Learn tab
575 test cases across unit + E2E + Lighthouse CI

The pivotal moment

March 20, 5:26 PM — The system shifted from "Claude does everything" to a three-tier extraction architecture. Azure Document Intelligence for structure, Azure Content Understanding for semantics, Claude only as fallback. This single decision cut extraction cost by 50–70% and improved accuracy to 99%+ on standard forms.

The Extraction Engine

Extraction Pipeline

The technical core

10,700 lines of domain-specific tax logic — not a wrapper around an LLM. Custom normalizers for every IRS form type. Cross-document validation. Confidence scoring per field. This is the hardest piece to replicate.

Architecture

Document Upload

↓

Azure DI

Structured extraction

Azure CU

Semantic understanding

Run in parallel · 45–90s timeout · Winner takes all

↓

20+ Validation Rules

+ 8 cross-document checks

↓

Claude Fallback

Only if Azure returns empty

↓

return_intelligence

60+ aggregated fields per return

Form Coverage

42+

IRS Form Types

W-2 Box 12 Codes

80+

Brokerage Registry

Typed Error Codes

Document Family	Types Covered	Key Fields
W-2	W-2, W-2c, W-4	All 22 Box 12 codes (D=401k, AA=Roth, W=HSA, Z=409A, V=NSO), multi-state, locality
1040 Family	1040, 1040-SR, 1040-NR + Schedules A/B/C/D/E/SE/1/2/3	180+ field mapping, derived effective tax rate, AMT flag
1099 Family	INT, DIV, NEC, MISC, R, B, G, K, SSA, SA, S, DA + 10 more	Transaction-level wash sale, RSU $0 cost basis detection, qualified vs ordinary dividends
1098 Family	1098, 1098-E, 1098-T	Mortgage interest, student loan, tuition + AOC eligibility gating
Other	K-1, 1095-A/C, paystub, ID docs, bank statements, receipts	YTD 401k tracking, over-contribution risk flag, employer match detection
Custom Analyzers	Form 8949, Form 8889, state returns	Gen AI-powered (trained CU analyzers), not generic OCR

Normalizer Depth

File	LOC	What It Does
`azure-cu-normalizers.ts`	1,311	W-2 + 1099 combo via Content Understanding field names
`azure-normalizers-income.ts`	1,261	15 income form normalizers via Document Intelligence
`azure-cu-normalizers-income.ts`	1,127	1099 family via CU (different field naming than DI)
`azure-normalizers-1040.ts`	996	1040 main form + all schedules via DI
`extractor.ts`	986	Unified extraction orchestrator with retry + timeout strategy

Why this is hard to replicate

409A failure code Z detection — triggers 20% penalty + deferred income taxation if missed. Most competitors map 5–8 Box 12 codes. We map all 22. RSU $0 cost basis flagging on 1099-B catches double-tax risk that costs clients $3,000+. Box 14 SDI/FLI regex handles OCR variants (CASDI, NJSDI, NYSDI) for state deductibility routing across CA/NJ/NY/WA.

Validation & Quality

20+ form-specific validation rules catch extraction errors before they reach the CPA. 8 cross-document checks catch inconsistencies across the full return.

Form-Level Checks

W-2: fed withheld ≤ wages, SS wages within 20% of total. 1099-B: proceeds ≥ 0, cost basis sanity. 1098: mortgage interest ≤ outstanding balance. Any field with confidence < 0.7 flagged for CPA review.

Cross-Document Checks

Paystub YTD ≈ W-2 wages. Duplicate 1099-INT from same payer. Duplicate 1099-B from same brokerage. Withholding rate sanity (5–40%). W-2 wages vs prior 1040 AGI consistency.

Atlas AI Agent

Not a chatbot

A 16-tool autonomous agent that fills organizer forms, books meetings, tracks IRS refunds, submits revision requests, and approves returns for e-filing. It acts on behalf of the user — it doesn't just talk.

Tool Ecosystem

Tool	What It Does	Writes to DB
`fill_organizer_field`	Saves to 4 tables (profiles, tax_profiles, addresses, income_info)	Yes
`approve_draft`	Marks return approved, notifies expert, triggers e-file pipeline	Yes
`show_booking`	Embeds Cal.com widget inline, updates return status	Yes
`submit_revision_request`	Records revision, updates pipeline status	Yes
`fill_questionnaire_events`	Batch yes/no for life events + financial events	Yes
`get_tax_estimate`	Real-time federal tax calculation from extracted data	Read
`track_irs_refund`	Triggers background IRS refund check	Yes
`suggest_upload_category`	Opens file picker for specific document type	Read
`navigate_organizer_section`	Moves sidebar to specific section (11 options)	Read
`suggest_replies`	Renders quick-reply chips in chat UI	Read
`list_uploaded_documents`	Current documents with extraction status	Read
`get_missing_documents`	Personalized missing doc list by income type	Read
`check_return_status`	Current pipeline stage + task completion	Read
`get_organizer_status`	Fields filled vs missing across all sections	Read
`trigger_external_workflow`	Starts background async job (IRS checks)	Yes
`contact_expert`	Escalation to human CPA/EA	Yes

Context Assembly

Before every response, Atlas loads 8 context layers in parallel via Promise.all(). Total assembly time: ~200–300ms.

buildAgentContext() — 8 Parallel DB Loaders

Return Intelligence

60+ aggregated fields

Conversation Memory

Top 3 by importance

Personality Profile

analyst / driver / amiable / expressive

Pending Doc Requests

Top 5 by dollar impact

Tax Rules

40 IRS limits (anti-hallucination)

What-If Scenarios

Actionable savings ranked by delta

Journey State

9 sub-queries, next action

Emotional Archetype

Detected from calls + behavior

Streaming & Performance

~400ms

Stream opens before context loads

80%

Cheaper on prompt cache hit

Max iterations per turn

SSE streaming opens the HTTP connection ~400ms before context assembly completes — the user sees the typing indicator immediately. Bidirectional sync: form fields update live as the user talks. Ephemeral prompt caching (5-minute TTL) saves ~80% on input tokens for rapid-fire conversations.

Determinism rule

Every dollar amount Atlas cites must come from database context, never from model weights. The 40 IRS tax rules are loaded as ground truth — Atlas never guesses a deduction limit or bracket threshold. This is how you build trust with anxious immigrants handling $200K+ in W-2 income.

The Intelligence Layer

Psychographic Intelligence

5 emotional archetypes detected from 705 real client calls (Fireflies + RingCentral), not personas invented in a workshop. Atlas adapts tone, urgency, and detail level per archetype. Refinement runs every 5 messages based on emotional signals.

52%

Anxious Immigrant

Reassure before inform. Avoid scary words (audit, penalty). "This is very common."

~20%

Optimizer

Show the math. Precise numbers. Asks "why" — answer with calculations.

~12%

Overwhelmed

Bottom line first. Skip jargon. One thing at a time. Don't stack questions.

~8%

Price Shopper

Lead with value, not features. Compare to alternatives. Quantify savings.

~8%

Relationship Seeker

Mirror warmth. "I've got you." Long-term loyalty over quick wins.

Alert System

8 risk flag types auto-detected from extracted data. Severity tiers prevent alert fatigue. Anti-contradiction rules ensure Atlas never contradicts CPA advice.

Alert	Detection Logic	Why It Matters
401(k) Over-Contribution	Sum Box 12 code D across all W-2s > $23,500	6% excise tax on excess if not corrected
HSA Over-Contribution	Sum Box 12 code W vs family/individual limit	6% excise tax + taxable income
RSU $0 Cost Basis	1099-B transactions with cost_basis = 0	Double-tax risk: $3,000+ per vesting event
Wash Sale Accumulation	Sum wash_sale_loss_disallowed across all 1099-Bs	Overstated loss = IRS audit trigger
FBAR / FATCA Required	Foreign account indicators in extraction	$10,000+ penalty per unreported account
Multi-State Filing	Multiple state entries across W-2s	Incorrect allocation = state audit
Backdoor Roth Opportunity	AGI > $161K + no IRA distribution	Missing $1,500+ annual tax savings
Underpayment Penalty Risk	Withheld < 90% of estimated liability	Avoid penalty via Q4 estimated payment

Prompt Management

Langfuse Versioning

Edit Atlas behavior without deploying code. 5 archetype prompt variants, emotional framework, curiosity mode. A/B testing with 20% experiment / 80% control via user ID hash.

Emotional Scoring

Every response scored for engagement + emotional shift. Confident (+2), relieved (+1), neutral (0), confused (-1), frustrated (-2), anxious (-1). Tracked in Langfuse traces.

Production Infrastructure

169

API Routes

692

TypeScript Files

575

Test Cases

Database Tables

Integrations

Service	Purpose	Status
Stripe	Dynamic checkout sessions, webhook payment confirmation, split invoicing	Production
Cal.com	Meeting scheduling, daily cron sync, auto-status (confirmed/completed/cancelled)	Production
Resend	Transactional email (welcome, referral, expert notify), archetype-aware copy	Production
Langfuse	Prompt versioning, A/B testing, conversation tracing, emotional scoring	Production
Upstash Redis	Rate limiting across serverless instances (5/min, 80/hr, 200/day)	Production
Sentry	Error tracking + performance monitoring + release health	Production

Security

Data Protection

RLS on all 59 tables. SSN encrypted with AES-256-GCM. UUID validation on all route params. 10 MB file upload limit enforced server-side. HSTS + CSP + X-Frame-Options: DENY.

Rate Limiting

Upstash Redis sliding window: 5 msg/min burst, 80 msg/hr sustained, 200 msg/day cap. Fail-open if Redis unreachable (never block users). Exponential backoff with Retry-After header respect.

Cost Efficiency

Prompt Caching

Ephemeral 5-min TTL on system prompt + tools array. ~80% cheaper input tokens, ~40% faster TTFT on cache hit. Estimated ~$0.50/user/month vs $2–3 without caching.

Model Selection

Claude Sonnet for conversation. Haiku for memory extraction (70% cheaper). Parallel Azure timeout forces 50–70% fewer expensive CU calls. Deterministic income builder replaced $0.005/doc Claude calls.

Testing

575 test cases across Vitest unit tests, Playwright E2E scenarios, and Lighthouse CI. Seeded test clients (w2-simple, freelancer, missing-docs) for repeatable E2E runs. GitHub Actions CI on every push.

The Multi-AI Orchestra

Multi-AI Orchestra

"The secret sauce isn't one AI model — it's the orchestration. Six specialized engines, each playing its role in a carefully choreographed system."

Azure DI

Structured extraction — prebuilt IRS form models, field-level confidence, deterministic output

Azure CU

Semantic understanding — handles unusual layouts, Gen AI-powered, custom trainable analyzers

Claude

Conversation + fallback — SSE streaming agent with 16 tools, edge case extraction

Gemini Flash

Reasoning layer — 12 specialized tax reasoning tasks, scenario analysis, planning

ElevenLabs

Voice synthesis — filing summaries as audio, sharable MP3s with weather-effect player

Langfuse

Prompt ops — versioning, A/B testing, conversation tracing, emotional scoring, cost tracking

Each service is chosen for what it does best. Azure for structured document understanding. Claude for nuanced conversation. Gemini for fast reasoning. ElevenLabs for natural voice. Langfuse for iteration speed. The orchestration layer is the moat — not any single model.

What's Next

The platform thesis expands

Everything built so far — the extraction engine, the agent architecture, the psychographic layer — is a horizontal capability. Tax was the proving ground. The intelligence layer is the product.

Beyond Tax

The workflow intelligence layer that monitors every client relationship in real-time. Flags unreplied communications, missing documents, and approaching deadlines before they become problems.

Beyond One Firm

Multi-tenant architecture. One codebase, vertical-specific configuration. Status pipelines, deadline calendars, signal types, and alert templates — all configurable per vertical.

Beyond US

300,000+ chartered accountants in India. Massively underserved. Distribution moat + product validation + seasonality buffer. Peak seasons offset: India (Jan–Mar, Jul–Sep) vs US (Jan–Apr).

Beyond Filing

Year-round engagement: tax planning, RSU vesting calendar, quarterly estimated payments, Roth conversion optimizer. The client relationship doesn't end at e-file — it begins.

"We built this in 25 days with a lean team. Imagine what happens with 12 months and a platform thesis."

Investor Intelligence