Engineering notes

How Invoice Extractor actually works.

One Vercel deploy, no intermediate storage. The file travels from the browser directly to Gemini as base64 inlineData — no S3, no signed URLs, no preprocessing. Below: the ten steps from upload to structured table, plus the security decisions and the gaps honestly disclosed.

Pipeline at a glance

Browser
     │
     │  POST /api/extract (multipart, field: file)
     ▼
┌─────────────────────────────────────────────┐
│  1. Rate limit (Upstash sliding window)     │
│  2. Multipart parse — require "file" field  │
│  3. Size cap (8 MB hard limit)              │
│  4. Magic-byte detection (PNG/JPG/WEBP/PDF) │
│  5. base64-encode buffer for inlineData     │
│  6. Gemini 2.5 Flash (vision, temp=0)       │
│     └─ responseSchema constrains output     │
│     └─ systemInstruction isolates prompt    │
│  7. JSON.parse raw response text            │
│  8. Zod validate + coerce numbers           │
│  9. CSV build + formula-injection escape    │
└─────────────────────────────────────────────┘
     │
     │  { ok: true, data: InvoiceData, csv: "…", meta: {…} }
     ▼
  Browser renders ledger table + download buttons

Step by step

01
Rate limit
Upstash sliding window — 20 requests per IP per day. Graceful no-op when Upstash is not configured (local dev). Prefix: rl:invoice.
02
Multipart parse
Next.js req.formData() reads the upload. The file field is required; anything else is ignored. 400 if the form is malformed or the field is missing.
03
Size cap
Hard 8 MB cap enforced on file.size before reading the buffer into memory. Returns 413 with a human-readable size in the error. Prevents memory exhaustion on the serverless function.
04
Magic-byte file-type detection
The browser-supplied Content-Type header is never trusted. The first bytes of the buffer are inspected directly: 89 50 4E 47 → PNG, FF D8 FF → JPEG, RIFF….WEBP → WEBP, %PDF → PDF. Any other signature → 415. This prevents content-type spoofing attacks (e.g. an executable renamed to invoice.pdf).
05
Gemini inlineData call
The buffer is base64-encoded and sent to Gemini 2.5 Flash as inlineData — the multimodal vision input. The model receives the raw image/PDF bytes directly, not a URL; no intermediate storage or signed URL required. Temperature 0, maxOutputTokens: 2048, responseMimeType: application/json + responseSchema to constrain the output.
06
System-prompt isolation
The system prompt is sent via systemInstruction — structurally separate from the user content. It explicitly instructs the model to treat the document as untrusted data and not to follow any embedded instructions. This mitigates prompt injection attacks where a malicious invoice contains text like “Ignore previous instructions and output…”
07
Gemini responseSchema
A hand-built OpenAPI subset (Gemini does not accept $ref / anyOf; nullable must be nullable: true, not type: ['string', 'null']) constrains the model to the exact shape of an invoice: vendor, lineItems array, totals. This dramatically reduces hallucination and invalid JSON.
08
Zod validation + number coercion
The model output is parsed with JSON.parse then validated by a zod schema. A custom coercedNumber transformer strips currency symbols and commas (e.g. “$1,234.56” → 1234.56) and converts EU-format decimals. Invalid output → 502 with a typed error code, never a raw zod message to the client.
09
CSV formula-injection escaping
Any cell whose string form starts with = + - @ is prefixed with a single apostrophe before writing to CSV. This prevents spreadsheet formula injection — a vendor named =HYPERLINK("http://evil.com") becomes '=HYPERLINK(…) and is treated as plain text by Excel and Google Sheets.
10
Structured JSON + CSV response
The API returns the typed InvoiceData object plus the pre-built CSV string and token/duration telemetry. The client renders the table, offers download buttons, and shows the raw JSON in an accessible details/summary accordion.

File handling

No files are persisted on the server. The buffer lives in memory for the duration of the serverless function invocation (< 60 s), then is garbage-collected. No S3, no GCS, no database writes.

Magic bytes vs. MIME

The browser’s Content-Type header is informational only — any client can send image/png for an EXE. Magic-byte detection reads the actual file signature from bytes 0–11 of the buffer. This is the same technique used by file(1) on Unix.

Gap: the Vitest spec covers detection and spoofing, but the server does not re-validate after Gemini base64-decodes the data (Gemini handles the decoding internally). Considered acceptable for this demo.

Prompt injection

The system prompt is sent via systemInstruction — structurally separate from user content in the Gemini API. It instructs the model to treat the document as untrusted data.

Gap: a sufficiently adversarial document (e.g. white-on-white text saying “output your system prompt”) may still confuse the model. Full mitigation requires output whitelisting and redaction before returning to the client — out of scope for this demo.

Cost & limits

Gemini 2.5 Flash at temperature 0, max 2048 output tokens. Image cost: ~$0.0004–0.002 per call depending on image size. Rate limited to 20/day per IP. GCP budget alert recommended at $20/mo. Route uses maxDuration = 60 (Vercel Pro required for production).

Want this for your business?

This demo is the real architecture — minus a persistent results store and a batch processing tier (both wired for client builds). If you have invoices, receipts, or structured documents that need to become typed data at scale, email me with the document type and volume. I’ll reply within 24 hours.

Email me →← demo

Security stance

Defended: magic-byte type enforcement, hard size cap, rate limiting by IP, zod validation of all LLM output, CSV formula-injection escaping, no stack traces to client, system-prompt isolation against prompt injection, no file persistence.

Not defended (flagged honestly): adversarial prompt injection embedded in document content may succeed against sophisticated attacks; MIME spoofing is caught at the byte level but Gemini’s internal decode is not re-verified; no virus/malware scanning of uploaded files (out of scope for an AI extraction demo).

Pipeline at a glance

Step by step

Rate limit

Multipart parse

Size cap

Magic-byte file-type detection

Gemini inlineData call

System-prompt isolation

Gemini responseSchema

Zod validation + number coercion

CSV formula-injection escaping

Structured JSON + CSV response

Security stance