Engineering notes

← back to demo

How Invoice Extractor actually works.

One Vercel deploy, no intermediate storage. The file travels from the browser directly to Gemini as base64 inlineData — no S3, no signed URLs, no preprocessing. Below: the ten steps from upload to structured table, plus the security decisions and the gaps honestly disclosed.

Pipeline at a glance

Browser
     │
     │  POST /api/extract (multipart, field: file)
     ▼
┌─────────────────────────────────────────────┐
│  1. Rate limit (Upstash sliding window)     │
│  2. Multipart parse — require "file" field  │
│  3. Size cap (8 MB hard limit)              │
│  4. Magic-byte detection (PNG/JPG/WEBP/PDF) │
│  5. base64-encode buffer for inlineData     │
│  6. Gemini 2.5 Flash (vision, temp=0)       │
│     └─ responseSchema constrains output     │
│     └─ systemInstruction isolates prompt    │
│  7. JSON.parse raw response text            │
│  8. Zod validate + coerce numbers           │
│  9. CSV build + formula-injection escape    │
└─────────────────────────────────────────────┘
     │
     │  { ok: true, data: InvoiceData, csv: "…", meta: {…} }
     ▼
  Browser renders ledger table + download buttons

Step by step

  1. 01

    Rate limit

    Upstash sliding window — 20 requests per IP per day. Graceful no-op when Upstash is not configured (local dev). Prefix: rl:invoice.

  2. 02

    Multipart parse

    Next.js req.formData() reads the upload. The file field is required; anything else is ignored. 400 if the form is malformed or the field is missing.

  3. 03

    Size cap

    Hard 8 MB cap enforced on file.size before reading the buffer into memory. Returns 413 with a human-readable size in the error. Prevents memory exhaustion on the serverless function.

  4. 04

    Magic-byte file-type detection

    The browser-supplied Content-Type header is never trusted. The first bytes of the buffer are inspected directly: 89 50 4E 47 → PNG, FF D8 FF → JPEG, RIFF….WEBP → WEBP, %PDF → PDF. Any other signature → 415. This prevents content-type spoofing attacks (e.g. an executable renamed to invoice.pdf).

  5. 05

    Gemini inlineData call

    The buffer is base64-encoded and sent to Gemini 2.5 Flash as inlineData — the multimodal vision input. The model receives the raw image/PDF bytes directly, not a URL; no intermediate storage or signed URL required. Temperature 0, maxOutputTokens: 2048, responseMimeType: application/json + responseSchema to constrain the output.

  6. 06

    System-prompt isolation

    The system prompt is sent via systemInstruction — structurally separate from the user content. It explicitly instructs the model to treat the document as untrusted data and not to follow any embedded instructions. This mitigates prompt injection attacks where a malicious invoice contains text like “Ignore previous instructions and output…”

  7. 07

    Gemini responseSchema

    A hand-built OpenAPI subset (Gemini does not accept $ref / anyOf; nullable must be nullable: true, not type: ['string', 'null']) constrains the model to the exact shape of an invoice: vendor, lineItems array, totals. This dramatically reduces hallucination and invalid JSON.

  8. 08

    Zod validation + number coercion

    The model output is parsed with JSON.parse then validated by a zod schema. A custom coercedNumber transformer strips currency symbols and commas (e.g. “$1,234.56” 1234.56) and converts EU-format decimals. Invalid output → 502 with a typed error code, never a raw zod message to the client.

  9. 09

    CSV formula-injection escaping

    Any cell whose string form starts with = + - @ is prefixed with a single apostrophe before writing to CSV. This prevents spreadsheet formula injection — a vendor named =HYPERLINK("http://evil.com") becomes '=HYPERLINK(…) and is treated as plain text by Excel and Google Sheets.

  10. 10

    Structured JSON + CSV response

    The API returns the typed InvoiceData object plus the pre-built CSV string and token/duration telemetry. The client renders the table, offers download buttons, and shows the raw JSON in an accessible details/summary accordion.

Security stance

Defended: magic-byte type enforcement, hard size cap, rate limiting by IP, zod validation of all LLM output, CSV formula-injection escaping, no stack traces to client, system-prompt isolation against prompt injection, no file persistence.

Not defended (flagged honestly): adversarial prompt injection embedded in document content may succeed against sophisticated attacks; MIME spoofing is caught at the byte level but Gemini’s internal decode is not re-verified; no virus/malware scanning of uploaded files (out of scope for an AI extraction demo).