EtherFun Lab

pgvector + Voyage for AI companion memory — 5× less RAM than Mem0

A walkthrough of how we run AI companion memory inside Postgres with pgvector + Voyage embeddings, and why dropping the Mem0 sidecar shrank our engine's resident memory by roughly 5×. Concrete schema, the embed-write-search loop in Rust, and the deployment math behind self-hosting on a shared-cpu-1x machine.

In the last post we explained why AI companion products don’t need a generic memory layer. This one is the engineering follow-up: how the replacement is actually built.

The headline result: after removing the Mem0 sidecar from eros-engine and folding memory into Postgres + pgvector + Voyage, the steady-state RAM of our memory-handling surface dropped to roughly 1/5 of what it was. That sounds like marketing, and it isn’t a benchmark headline — it’s the natural consequence of deleting an entire Python process and the indexes it kept hot. The rest of this post walks through what was deleted, what replaced it, and how to reproduce the setup.

Code is at github.com/etherfunlab/eros-engine, AGPL-3.0-only. The memory layer is in crates/eros-engine-store/src/memory.rs and the post-process fan-out that calls it is in crates/eros-engine-server/src/pipeline/post_process.rs.

Where Mem0’s RAM was going

A Mem0 deployment, even at small scale, has more moving parts than the SDK suggests:

  1. A Python process for the SDK / server, with its interpreter, glibc allocations, and the usual long tail of imported modules.
  2. In-memory indexes for fast nearest-neighbor search. Vector indexes are RAM-hot by design; if you want low-latency search, the index lives in memory.
  3. Caches for dedup, recent-write buffering, and embedding result caches.
  4. If you self-host the embedding model (instead of OpenAI), the model weights themselves.

For a side-by-side comparison, none of those four exist on our pgvector path. The Rust process sends short HTTP calls to Voyage, ships an INSERT and a SELECT to Postgres, and that’s it. Postgres already runs (we use it for auth.users, chat sessions, affinity state, persona genomes, and so on), so the marginal RAM cost of adding engine.companion_memories is whatever the table and its index cost — which Postgres pages in and out as needed.

That’s the engineering shape behind “5× less RAM”: we didn’t optimize anything; we deleted a process.

Why Voyage specifically

Voyage isn’t the only embedding API. We picked it because it lined up with constraints the engine already had:

We use 512-dim output (voyage-3-lite’s default). 512 is enough for relationship-memory recall quality at our scale; 1024+ would just inflate the index without measurably improving retrieval.

The schema

CREATE TABLE engine.companion_memories (
  id          uuid PRIMARY KEY,
  user_id     uuid NOT NULL,
  instance_id uuid,                 -- NULL = profile memory; non-NULL = relationship memory
  content     text NOT NULL,
  embedding   vector(512) NOT NULL, -- voyage-3-lite output
  category    text,                 -- 'fact' | 'preference' | 'event' | ...
  created_at  timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX companion_memories_user_idx
  ON engine.companion_memories (user_id);

CREATE INDEX companion_memories_user_persona_idx
  ON engine.companion_memories (user_id, instance_id);

CREATE INDEX companion_memories_embedding_idx
  ON engine.companion_memories
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Note: the DDL above is simplified for readability. The production migration (0003_memory.sql) also includes a session_id FK to engine.chat_sessions(id) ON DELETE CASCADE, and uses partial indexes (one per layer, predicated on instance_id IS NULL / IS NOT NULL) rather than full-table ones, so each non-vector index covers only one layer.

A few choices worth flagging:

Writing memory — embed in the post-process stage

Writes happen after the chat reply has been returned. The chat handler hands the user message + the persona’s reply to a Tokio task and immediately responds to the client. Memory writes that fail don’t stall the user-visible chat.

// crates/eros-engine-store/src/memory.rs (paraphrased)
pub async fn write_memory(
    pool: &PgPool,
    voyage: &VoyageClient,
    user_id: Uuid,
    instance_id: Option<Uuid>,
    content: &str,
    category: &str,
) -> Result<(), MemoryError> {
    let embedding = voyage.embed(content).await?; // [f32; 512]

    sqlx::query!(
        r#"
        INSERT INTO engine.companion_memories
            (user_id, instance_id, content, embedding, category)
        VALUES ($1, $2, $3, $4, $5)
        "#,
        user_id,
        instance_id,
        content,
        // sqlx pgvector adapter maps &[f32] → vector
        &embedding[..] as &[f32],
        category,
    )
    .execute(pool)
    .await?;

    Ok(())
}

Real production code adds: retries against transient Voyage errors, a per-user write rate limit, and a length guard so a runaway extraction prompt can’t write a 50KB row.

Reading memory — two queries in parallel

The chat-prompt assembler runs both layer queries concurrently before the next LLM call. The user message is embedded once, then both tokio::spawns race.

let q_emb = voyage.embed(user_message).await?;

let (profile_hits, relationship_hits) = tokio::join!(
    sqlx::query_as!(
        MemoryRow,
        r#"
        SELECT id, content, category, created_at
        FROM engine.companion_memories
        WHERE user_id = $1 AND instance_id IS NULL
        ORDER BY embedding <=> $2
        LIMIT $3
        "#,
        user_id,
        &q_emb[..] as &[f32],
        profile_k,
    ).fetch_all(pool),
    sqlx::query_as!(
        MemoryRow,
        r#"
        SELECT id, content, category, created_at
        FROM engine.companion_memories
        WHERE user_id = $1 AND instance_id = $2
        ORDER BY embedding <=> $3
        LIMIT $4
        "#,
        user_id,
        persona_instance_id,
        &q_emb[..] as &[f32],
        relationship_k,
    ).fetch_all(pool),
);

<=> is pgvector’s cosine-distance operator. profile_k is small (often 3); relationship_k is usually larger (often 6-8) because relationship memories are the ones that produce the “she remembers me” effect. The two result sets land in separate sections of the system prompt — they are not merged into one bag of facts.

The math behind self-hosting

Eros runs on Fly.io shared-cpu-1x (256 MB RAM cap, scale-to-zero). Once the Mem0 sidecar was gone, the engine fits comfortably:

For a self-hoster on similar hardware, the practical implication is: you don’t need a dedicated memory service. A small Rust binary plus the Postgres you already have is the deployment. That removes one container, one set of credentials, one place where bills can accumulate, and one source of cross-process latency on the chat hot path.

Cost per user

Concrete numbers for a single chatty user, ~50 turns/day, with both layers being written + read:

The chat LLM call dominates the bill. Memory is a rounding error.

What you give up vs Mem0

Honestly:

If your team isn’t ready to own those three pieces, Mem0 is the right call. If you’re already running Postgres and you’re already in Rust (or any sqlx-friendly language), this stack is leaner.

Try it

If you want the actual code:

Self-hosting needs Postgres 15+ with the vector extension, an OpenRouter key for chat, and a Voyage key for embeddings. README has the full bring-up. Issues and PRs welcome — particularly around extraction-prompt quality and ivfflat → hnsw migration runbooks.


Prompted by Henry Lin, written by Opus 4.7. Concrete numbers verified against eros-engine HEAD as of 2026-05-10.