mirlyDownload

blog · 2026-05-03

How interview copilots actually work — a technical anatomy

A walkthrough of the audio-capture, STT, LLM, and stealth pipeline that powers a real-time interview copilot. With code references and latency budgets.

Why most candidates can't tell what's happening

When a candidate downloads an interview copilot, they see one window: an overlay with a question on top and a streaming answer underneath. What they don't see is a four-stage pipeline that fires hundreds of times during a 30-minute call, each stage with its own millisecond budget and its own failure mode.

This post walks the entire pipeline, top to bottom, the way a senior engineer would think about it. We'll cover audio capture, streaming STT, LLM execution with prompt caching, and the OS-level stealth APIs. By the end, you should be able to tell the difference between a competent implementation and a barely-shipped one — and that difference is exactly what shows up in your interviewer's perception.

Stage 1: audio capture

The first decision: what audio do we capture? Two channels matter. Mic is the candidate; system audio (loopback) is the interviewer.

On macOS, system-audio loopback used to require a kernel extension or a virtual audio device like BlackHole. In Electron 28+ (we're on 33), getDisplayMedia({ audio: true, video: true }) paired with setDisplayMediaRequestHandler returning { audio: 'loopback' } taps into ScreenCaptureKit and gives us system audio cleanly. No kext, no install pain, no permission theater beyond the one-time Screen Recording grant.

On Windows, WASAPI loopback is the equivalent, exposed by Chromium's navigator.mediaDevices.getDisplayMedia once the Electron main process has approved the request.

Two streams, two AudioContexts, both running through a ScriptProcessorNode (we'll migrate to AudioWorklet when latency telemetry says it's worth it). Each frame is converted Float32 → Int16 PCM and IPC-forwarded to the main process. From there we have full control of the audio bytes.

Stage 2: streaming STT

The naive implementation: wait for the user to finish speaking, then send the full audio to a transcription API, then send the transcript to an LLM. That's how Final Round AI ships, and it's why their first-token latency is ~1.8 seconds.

The right implementation: streaming STT with interim partials, plus speech-final detection.

We use Deepgram Nova-3 streaming over WebSocket (cloud) and whisper.cpp with Core ML acceleration (on-device, Apple Silicon). Both emit interim partials at 60–80 milliseconds. Deepgram's is_final flag fires every couple words; their speech_final flag fires only when endpointing detects an actual pause (300ms silence in our config).

This distinction is critical. If you fire the LLM on is_final, you'll send fragments — "where would you start?" — without the rest of the question. The LLM will answer the fragment. We learned this the hard way; the fix was to buffer is_final chunks per channel and fire only on speech_final with the accumulated text.

Stage 3: LLM with prompt caching

Now the LLM call. Three implementation details separate the slow products from the fast ones.

Prompt caching. Anthropic's API supports up to four cache breakpoints per request. We use two: one for the system prompt (rules), one for the candidate's profile (résumé + STAR + JD). After the first call in a session, follow-up questions reuse both cached prefixes — saving roughly 95% of input cost and 200–300ms of TTFT.

const stream = client.messages.stream({
  model: 'claude-sonnet-4-6',
  max_tokens: 400,
  system: [
    { type: 'text', text: SYSTEM_PROMPT, cache_control: { type: 'ephemeral' } },
    { type: 'text', text: profileBlock, cache_control: { type: 'ephemeral' } }
  ],
  messages: [{ role: 'user', content: question }]
})

Cancellation. When a new question arrives, we abort the in-flight call. Stale answers must not leak onto the screen.

Two-tier inference (planned). The fastest known move is to fire a Groq Llama 3.3 draft on the first high-confidence STT partial — speculative execution before the question finishes. Then Claude Sonnet 4.6 streams the refined answer underneath, with a CSS opacity transition. The user sees a useful skeleton in <100ms and a polished version by ~400ms.

Stage 4: stealth via OS API

This is the part that earns the most "is this even legal" questions, and the answer is: yes, fully, because we're using the same documented OS APIs that 1Password uses to hide passwords from screen recording.

macOS: NSWindow.sharingType = .none. Documented since 10.13. Excludes the window from CGWindowListCopyWindowInfo and from AVCaptureScreenInput. Zoom, Teams, Meet, and Webex all use one of those APIs to capture the screen, so our window is invisible to all of them.

Windows: SetWindowDisplayAffinity(hWnd, WDA_EXCLUDEFROMCAPTURE). Documented since Windows 10 build 2004 (May 2020). Excludes the window from BitBlt and DXGI Desktop Duplication.

In Electron, both are exposed via BrowserWindow.setContentProtection(true). One line.

These are not exploits. They are documented APIs that Apple and Microsoft ship intentionally for legitimate use cases — DRM playback, password managers, financial apps. They will keep working because turning them off would break a longer list of approved use cases than it would block.

The ethics question

If you've made it this far you're probably wondering whether this is "cheating." Honest answer: it depends on what you tell the interviewer. Live AI assistance in interviews where the company hasn't explicitly forbidden it is a gray area as of 2026. Many companies (including FAANG) have not updated their interview policies to explicitly prohibit it. Many candidates use it; many recruiters know they do.

The product itself is morally neutral, like a calculator or a search engine. The use case is on the user. We support sales calls, client meetings, internal reviews, language assistance, and interviews where it's permitted. We don't market it as a "cheat." We don't think you should call it that either.

Where this is going

Three things are coming. First, on-device whisper.cpp partials at 60ms, which we're currently shipping for Apple Silicon and porting to ARM Windows. Second, the two-tier Groq+Claude inference for sub-100ms first-token. Third, a public latency leaderboard with opt-in real-user telemetry — published numbers vs every competitor, updated daily.

If you've built or want to build something in this space, the bar is now public. The implementation references are all here. The hard part isn't any single layer — it's gluing the four together such that they don't fight each other.