GitHub Year in Review: 2024 — What Open Source Changed in the Engineering Stack
Content reflects the state as of January 2025. AI tooling and model capabilities in this area change frequently.
At the start of 2024, AI assistants answered questions. They did not act. Engineers building AI-augmented systems still scraped their own web data with Selenium, wrote custom database connectors for each LLM integration, and maintained separate embedding pipelines decoupled from their primary datastores. By October, browser-use had shipped a library that handed any LLM a real Chromium browser to operate. OpenHands had reached 74,000 GitHub stars after researchers demonstrated it could autonomously fix GitHub issues end-to-end. Google had open-sourced an MCP server that connected Claude, Gemini, and other MCP-compatible clients to BigQuery, Spanner, and PostgreSQL without a line of custom connector code. Three convergent waves defined the year: the operator layer arrived, the knowledge retrieval layer got a graph spine, and the database-to-AI interface standardized around a protocol. Nine repositories show exactly where each shift happened.
The Year at a Glance
| Theme | Repository | Domain | Eliminated Manual Task | Peak Stars |
|---|---|---|---|---|
| Agents as Operators | firecrawl/firecrawl | System Design | Custom per-site scraping pipelines for AI input | 123,403 |
| Agents as Operators | browser-use/browser-use | System Design | Per-site Playwright automation scripts | 95,226 |
| Agents as Operators | OpenHands/OpenHands | Developer Productivity | Manual write-test-debug cycle for every code change | 74,651 |
| RAG with Graph | microsoft/graphrag | System Design | Flat vector search for multi-hop document questions | 33,182 |
| RAG with Graph | HKUDS/LightRAG | System Design | Maintaining separate vector DB and graph DB pipelines | 35,620 |
| RAG with Graph | getzep/graphiti | System Design | Ad-hoc agent memory using truncated message lists | 26,430 |
| Databases Go AI-Native | googleapis/mcp-toolbox | Databases | Custom connector per AI assistant per database | 15,323 |
| Databases Go AI-Native | Canner/WrenAI | Databases | Brittle NL2SQL prompt engineering without schema semantics | 15,310 |
| Databases Go AI-Native | timescale/pgai | Databases | External embedding pipeline with manual synchronization | 5,802 |
Situation
Three technical constraints were keeping AI systems to the role of answering questions rather than taking action at the start of 2024. First, connecting an LLM to real-world data — a website, a database, a codebase — required writing and maintaining a custom connector for each pairing; no standard interface existed. Second, RAG systems built on vector similarity search had a documented failure mode with multi-hop questions: vector search returns isolated chunks, not relationships between entities across documents. Third, LLM agents had no persistent memory of facts that changed over time — session history truncation meant the agent forgot; flat storage meant it could not resolve contradictions. The year’s open-source releases addressed each constraint, and the star counts confirm the adoption was not theoretical.
The Problem at Year Start
| Domain | Manual task | Engineering cost | Status at year end |
|---|---|---|---|
| System design | Writing per-site Playwright scripts for web data extraction | 1–3 days per site; breaks on UI changes | Eliminated for LLM-ready output by firecrawl |
| System design | Building per-LLM per-database connector code | 1–2 weeks per integration; repeated for every new model | Standardized via MCP; mcp-toolbox covers 11+ databases |
| System design — RAG | Multi-hop questions over document corpora | Poor accuracy from vector search; hours of prompt engineering | Graph-augmented retrieval addressable via graphrag and LightRAG |
| Platform engineering | Deploying AI agents to production Kubernetes | 4–8 hours per new agent workload; bespoke manifests per service | Partially reduced; agent frameworks matured across the year |
| Databases | Maintaining external embedding pipeline synchronized with source data | Ongoing ops; stale embeddings accumulate during outages | Automated by pgai vectorizer inside PostgreSQL |
| Databases | NL2SQL without hallucinating column or table names | Per-query schema-dump prompting; business definitions not captured | Semantic layer approach standardized by WrenAI |
The question 2024 answered: can open-source AI tooling at the infrastructure layer remove the connector-writing, pipeline-building, and prompt-engineering overhead that consumes engineering cycles each time a new AI use case begins?
2024: AI Tooling Moved from Answering to Acting
flowchart TD
A[2024 — AI stopped answering and started acting] --> B[Theme 1 — Agents as Operators]
A --> C[Theme 2 — RAG with Graph Structure]
A --> D[Theme 3 — Databases Go AI-Native]
B --> E[firecrawl — web data for AI]
B --> F[browser-use — AI controls browser]
B --> G[OpenHands — AI edits and runs code]
C --> H[graphrag — entity graph from documents]
C --> I[LightRAG — hybrid graph and vector retrieval]
C --> J[graphiti — temporal agent memory]
D --> K[mcp-toolbox — MCP server for databases]
D --> L[WrenAI — semantic layer for NL2SQL]
D --> M[pgai — embeddings inside PostgreSQL]
Theme 1: AI Agents Learned to Operate the Computer
Building an AI system that acted on the web in early 2024 meant writing brittle Playwright scripts per site, or accepting that your agent was constrained to text generation. Three repositories removed that constraint by shipping the operator layer as a reusable dependency — the plumbing that connects an LLM to real systems.
firecrawl/firecrawl — replacing per-site scraping pipelines with a single web API
- Before — the manual workflow: JavaScript-heavy pages required Selenium or Playwright; proxy rotation, rate limiting, and content cleaning were per-project work that did not transfer across sites.
# Before: JS-rendered pages require Playwright; output needs manual cleaning from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("https://example.com") html = page.content() # Manual extraction, markdown conversion, proxy rotation — all bespoke per site - After — with firecrawl:
# After: firecrawl Python SDK — one call returns LLM-ready markdown from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="fc-...") result = app.scrape_url("https://example.com", formats=["markdown"]) # result.markdown: complete content, JS-rendered, proxy-handled, clean - The productivity delta: According to the project README, firecrawl “handles rotating proxies, orchestration, rate limits, JS-blocked content, and more — zero configuration.” The README reports P95 latency of 3.4 seconds across millions of pages. The engineer no longer maintains a per-site extraction layer or manages proxy infrastructure.
- How it works: Firecrawl wraps a headless browser pool with proxy rotation and content normalization. Output formats include markdown, structured JSON, screenshots, and links — all sized for LLM token budgets. The README states it “covers 96% of the web, including JS-heavy pages.”
- Where it breaks: The hosted service has rate limits proportional to the plan. Self-hosting moves the proxy pool management back to the team — the operational complexity Firecrawl abstracts. For high-volume, budget-constrained scraping, the self-hosted version requires provisioning and operating the proxy infrastructure the README describes as “handled.”
browser-use/browser-use — replacing per-site Playwright scripts with an LLM-controlled browser
- Before — the manual workflow: Web task automation required a script that knew the target site’s DOM — specific selectors, form field names, navigation sequences. Each script was brittle to UI changes and non-transferable to new sites.
# Before: Playwright script tied to one site's DOM structure from playwright.async_api import async_playwright async with async_playwright() as p: browser = await p.chromium.launch() page = await browser.new_page() await page.goto("https://example.com/form") await page.fill('input[name="email"]', "user@example.com") await page.click('button[type="submit"]') # Breaks if the site redesigns the form; does not generalize - After — with browser-use: the LLM reads the page visually and adapts to layout changes without script updates.
# After: browser-use — agent navigates any site from a task description from browser_use import Agent from langchain_openai import ChatOpenAI agent = Agent( task="Fill out the contact form with name 'Test User' and email 'test@example.com'", llm=ChatOpenAI(model="gpt-4o"), ) result = await agent.run() - The productivity delta: The project README states browser-use “makes websites accessible for AI agents” by providing browser control without per-site script maintenance. The README notes the library works with any LLM via LangChain, and a cloud service is available for teams that want hosted browser sessions.
- How it works: The library passes visual DOM state to the LLM, which generates action sequences (click, fill, scroll, navigate) based on the task description. No site-specific selectors are needed.
- Where it breaks: Agents navigating visually are slower and more expensive per task than scripted automation. For deterministic, high-frequency workflows (thousands of daily runs), a maintained Playwright script remains cheaper. Browser-use’s value is highest for irregular tasks or sites that change layout frequently.
OpenHands/OpenHands — replacing the manual write-test-debug cycle with an autonomous coding agent
- Before — the manual workflow: A developer reads a failing test, edits the function, re-runs the test suite, interprets the output, and repeats — context switching between editor, terminal, and ticket.
# Before: manual write-test-debug loop vim src/parser.py python -m pytest tests/test_parser.py -v # Read failure output, return to editor, repeat until green - After — with OpenHands CLI:
# After: OpenHands handles the read-edit-test loop autonomously openhands run --task "Fix the failing test in tests/test_parser.py; \ the parse_config function is not handling null values in the options dict" # OpenHands reads files, edits code, runs tests, interprets output, iterates - The productivity delta: The project README reports a 77.6% SWE-Bench score — a benchmark measuring autonomous resolution of real GitHub issues. The README links to the benchmark spreadsheet. This is a documented adoption signal: the agent resolves most well-specified coding tasks without a human in the loop.
- How it works: OpenHands provides a sandboxed runtime where an AI agent reads files, edits code, runs test suites, and interprets terminal output. The README describes both a CLI for single tasks and an SDK for running agents at scale.
- Where it breaks: An agent solution may be functionally correct but deviate from team coding conventions — naming, patterns, error handling idioms. Human review before merge is still required. The README SDK is designed to be composable, allowing teams to constrain the file scope available to the agent per task.
Theme 2: RAG Grew a Graph Spine
By early 2024, vector similarity search as the sole retrieval mechanism had a documented failure mode: questions requiring multi-hop reasoning — “how does A relate to B through C?” — returned isolated chunks rather than connected answers. Three repositories shipped in 2024 by adding a graph layer to the retrieval process, each targeting a different part of the problem: indexing, retrieval, and persistent agent memory.
microsoft/graphrag — entity graph extraction for multi-hop document retrieval
- Before — the manual workflow: Standard RAG embeds document chunks and retrieves the top-k most similar chunks. Multi-hop questions fail because the answer requires traversing entity relationships that do not co-occur in any single chunk.
# Before: flat vector RAG — isolated chunks, no relational context # Question: "What themes connect John's research and Mary's implementation work?" # Vector search returns John's chunks OR Mary's chunks — not their intersection # The relationship between them lives in neither chunk individually - After — with graphrag:
# After: graphrag indexes documents into an entity-relationship graph pip install graphrag python -m graphrag index --root ./my-documents # Extracts entities, relationships, and community summaries via LLM calls python -m graphrag query --root ./my-documents \ --method global \ --query "What themes connect all the research papers?" # Graph traversal finds cross-document connections unavailable to vector search - The productivity delta: According to the README and the linked Microsoft Research blog post (arXiv 2404.16130), GraphRAG “unlocks LLM discovery on narrative and private data” by maintaining graph-structured knowledge that supports global query mode — summarizing across the entire corpus — which flat vector search cannot do.
- How it works: GraphRAG runs an LLM-powered indexing pipeline that extracts named entities and relationships from each document, then organizes them into community clusters. At query time, graph traversal finds cross-document connections. The README notes two query modes: local (specific entity focus) and global (corpus-wide summarization).
- Where it breaks: The README includes a direct warning: “GraphRAG indexing can be an expensive operation — please read all of the documentation and start small.” The LLM-powered extraction step runs at index time and costs proportionally to corpus size. Not suitable for large-scale indexing without cost controls in place first.
HKUDS/LightRAG — hybrid graph and vector retrieval from a single unified index
- Before — the manual workflow: Teams running both semantic similarity and relationship traversal maintained two separate systems — a vector store and a graph database — each with its own ingestion pipeline, update cadence, and query interface.
# Before: two separate systems for two retrieval modes # System 1: embed chunks → vector store → similarity search # System 2: extract entities → graph DB → traversal queries # Two pipelines to maintain; two sets of stale data to manage - After — with LightRAG: a single index supports vector similarity, graph traversal, and hybrid modes.
# After: LightRAG — one index, four retrieval modes from lightrag import LightRAG, QueryParam rag = LightRAG(working_dir="./rag_cache") await rag.ainsert("path/to/documents/") # Hybrid mode uses both vector similarity and graph traversal result = await rag.aquery( "How does the new architecture affect the legacy system?", param=QueryParam(mode="hybrid") ) - The productivity delta: According to the project README and arXiv paper (2410.05779), LightRAG supports four retrieval modes — naive, local, global, and hybrid — from a single unified index. The engineer no longer maintains separate systems for queries that require different retrieval strategies.
- How it works: LightRAG extracts a knowledge graph during ingestion, stores both graph edges and vector embeddings in a unified index, and routes each query to the appropriate retrieval mode. The paper was accepted at EMNLP 2025.
- Where it breaks: The quality of the knowledge graph depends on the LLM used during indexing. Low-quality or poorly-prompted models produce noisy graph extractions that degrade retrieval for graph-dependent query modes. The embedding and graph extraction are both LLM calls — compute costs scale with corpus size.
getzep/graphiti — temporal knowledge graph for agent memory that handles facts that change over time
- Before — the manual workflow: AI agents maintained context via a truncated message history. Facts from earlier sessions were lost when the history was trimmed. Contradictions between old and new facts accumulated with no mechanism to resolve which was current.
# Before: agent memory = message list, truncated at context limit messages = [] # newest 20 messages; earlier facts are gone # Session 1: "Project Alpha is in planning" # Session 15: "Project Alpha shipped" # Agent has no way to know which fact is currently true - After — with graphiti: each interaction adds to a temporal knowledge graph that tracks which facts are currently valid.
# After: graphiti maintains a temporal graph from agent episodes from graphiti_core import Graphiti graphiti = Graphiti("bolt://localhost:7687", "neo4j", "password") await graphiti.add_episode( name="session_42", episode_body="Project Alpha shipped to production on January 15." ) # Returns facts that are currently true — temporal contradictions resolved facts = await graphiti.search("What is the current status of Project Alpha?") - The productivity delta: According to the README, Graphiti’s context graphs “track how facts change over time, maintain provenance to source data, and support both prescribed and learned ontology — making them purpose-built for agents operating on evolving, real-world data.” The agent no longer loses information at session boundaries or accumulates unresolved contradictions.
- How it works: Graphiti extracts entities and relationships from each episode (agent interaction), stores them in a Neo4j graph, and marks temporal validity on each edge so queries return the currently-true state. The repo also includes an MCP server that lets Claude, Cursor, and other MCP-compatible clients use Graphiti as their memory backend.
- Where it breaks: Graphiti requires a running Neo4j instance (or a compatible managed graph database). Teams without an existing graph database add a new infrastructure dependency. The temporal resolution quality depends on LLM entity extraction during the
add_episodestep.
Theme 3: Databases Gained a Native AI Interface
At the start of 2024, connecting a database to an LLM required writing a custom connector: one integration for Claude, another for Gemini, another for each new model. Three repositories removed that per-pairing work in 2024, each targeting a different layer of the database-to-AI interface.
googleapis/mcp-toolbox — one MCP server connecting any AI agent to any database
- Before — the manual workflow: Each AI assistant required its own database integration. Adding a new model meant writing and maintaining a new connector in that model’s tool-calling format.
# Before: same database logic registered separately for each LLM # For Claude: tool defined in Anthropic tool-use format # For Gemini: same logic, different SDK, different schema format # For new model: write it again def search_products(name: str) -> list: conn = psycopg2.connect(DATABASE_URL) cursor.execute("SELECT * FROM products WHERE name ILIKE %s", (f"%{name}%",)) return cursor.fetchall() - After — with mcp-toolbox: define tools once in YAML; any MCP-compatible client connects.
# After: toolbox_config.yaml — write once, connect from any MCP client sources: products-db: kind: postgres host: ${DB_HOST} database: products tools: search-products: kind: postgres-sql source: products-db description: "Search products by name" parameters: - name: query type: string description: "Product name search term" statement: SELECT id, name, price FROM products WHERE name ILIKE $1toolbox serve --tools-file toolbox_config.yaml # Claude Code, Gemini CLI, and other MCP clients — all connect; no per-client code - The productivity delta: According to the README, mcp-toolbox “serves a dual purpose: a ready-to-use MCP server that instantly connects AI clients to databases, and a robust framework to build specialized AI tools for production agents.” The tool definition is written once and serves all connected clients.
- How it works: The server implements the Model Context Protocol and exposes database-backed tools via a standardized interface. Supported databases per the README topics and description include BigQuery, Spanner, PostgreSQL, MySQL, Redis, Firestore, MongoDB, Elasticsearch, Oracle, ClickHouse, CockroachDB, and TiDB.
- Where it breaks: The README notes that custom tools require careful parameterization to prevent SQL injection — the framework does not automatically sanitize inputs. Every tool definition needs a security review before it is exposed to a production agent.
Canner/WrenAI — semantic context layer that teaches AI agents what business data means
- Before — the manual workflow: NL2SQL prompts included raw schema dumps — table names, column names — and relied on the LLM to infer business meaning. Queries crossing multiple tables or depending on business-specific definitions (revenue = net amount after refunds) produced plausible but wrong SQL.
-- Before: LLM infers semantics from raw schema; gets the shape right, the logic wrong -- Context given: "orders(id, customer_id, amount, refund_amount, created_at)" -- Question: "Who are our top customers by revenue?" -- LLM output: SELECT customer_id, SUM(amount) FROM orders GROUP BY 1 ORDER BY 2 DESC -- Wrong: uses gross amount; no customer name join; no quarter filter - After — with WrenAI: the semantic model defines what data means; agents query through the context layer.
# After: WrenAI semantic context layer pip install wrenai # Semantic model defines: revenue = amount - refund_amount; customer name from customers table wren ask "Who are our top 10 customers by net revenue this quarter?" # WrenAI resolves semantics, generates correct SQL, returns verified results - The productivity delta: According to the README, WrenAI is “the open context layer for AI agents over business data — your agent doesn’t know what your data means. We fix that.” The semantic layer prevents the class of wrong-but-plausible SQL that schema-only prompting produces.
- How it works: WrenAI maintains a semantic layer (MDL — Modeling Definition Language) that maps business concepts to the underlying schema. AI agents query through this layer rather than against raw tables, and the engine translates natural language into semantically-grounded SQL.
- Where it breaks: The semantic model requires manual maintenance when the underlying schema changes. If a column is renamed or a business definition shifts, the MDL needs to be updated separately — it does not automatically sync from schema migrations.
timescale/pgai — automatic vector embeddings and semantic search inside PostgreSQL
- Before — the manual workflow: AI applications maintained an external embedding pipeline — call the embedding API on new or updated rows, push embeddings to a separate vector store, handle synchronization failures, manage stale embeddings when source data changed.
# Before: external embedding pipeline decoupled from source data def sync_embeddings(): rows = db.execute( "SELECT id, text FROM docs WHERE updated_at > %s", (last_sync,) ) for row in rows: embedding = openai.embeddings.create( input=row.text, model="text-embedding-3-small" ) vector_store.upsert(row.id, embedding.data[0].embedding) # Runs on a cron; stale embeddings accumulate during API outages - After — with pgai: the vectorizer runs inside PostgreSQL, triggered automatically by data changes.
# After: pgai vectorizer — embeddings stay synchronized inside the database import pgai vectorizer = pgai.create_vectorizer( "docs", destination="docs_embeddings", embedding=pgai.openai_embedding("text-embedding-3-small", 1536), chunking=pgai.character_text_splitter(chunk_size=800), ) # pgai workers re-embed automatically when docs data changes # Query with standard SQL + pgvector; no separate vector store to operate - The productivity delta: According to the README, pgai “automatically creates and synchronizes vector embeddings from PostgreSQL data and S3 documents” with “embeddings [that] update automatically as data changes.” The external sync cron and its stale-embedding handling are eliminated.
- How it works: pgai installs as a Python package with database components. Stateless vectorizer workers watch for data changes via the configuration, process a queue, and write embeddings back to PostgreSQL. The README notes the architecture “decouples data modifications from the embedding process so failures in the embedding service do not affect core data operations.” Works with any PostgreSQL — RDS, Supabase, Timescale Cloud (all cited in the README).
- Where it breaks: pgai requires deploying and operating vectorizer worker processes alongside the database. For managed PostgreSQL deployments, the worker is an additional compute process with its own health monitoring. The decoupling means a worker outage stops embedding updates without affecting read/write on the underlying data — correct behavior, but the queue lag needs independent observability.
Year-over-Year Signal
| Domain | Manual task at year start | Status at year end | What drove the change |
|---|---|---|---|
| System design — web | Per-site Playwright automation for web tasks | Replaced for irregular tasks by browser-use; scripted automation still cost-effective for deterministic high-frequency flows | browser-use shipped Oct 2024; LLM vision quality crossed a usability threshold |
| System design — AI connectors | Custom per-LLM per-database connector code | Partially standardized via MCP; mcp-toolbox unifies 11+ databases under one server definition | Model Context Protocol gained cross-vendor adoption in 2024 |
| System design — RAG | Flat vector search as the default retrieval mechanism | Graph-augmented retrieval available via graphrag and LightRAG; production adoption still early for most teams | graphrag shipped Mar 2024, LightRAG Oct 2024; peer-reviewed research backed both |
| Databases | External embedding pipeline with manual sync | Automated for PostgreSQL stacks by pgai vectorizer | pgai shipped May 2024 with synchronization as a first-class design goal |
| Databases — NL2SQL | Schema-dump prompting for text-to-SQL | Semantic layer approach available via WrenAI; eliminates the class of wrong-but-plausible SQL from schema inference | WrenAI’s MDL provides business-concept grounding that raw schema prompting cannot |
| Infrastructure | Redis as the community default distributed cache | Valkey (25,887 stars) forked and became an LF project; migration from Redis ongoing across the ecosystem | Redis changed its license to SSPL and RSALv2 in March 2024 |
In Practice
- Theme 1 — Agents as Operators: firecrawl’s P95 latency figure (3.4s), proxy handling description, and 96% web coverage are stated in the README. OpenHands’ 77.6% SWE-Bench score appears in the README badge with a link to the benchmark spreadsheet. Browser-use’s LLM-driven navigation model is described in the quickstart. I have not run OpenHands on a production codebase; the SWE-Bench score measures autonomous issue resolution on a curated benchmark, not arbitrary production work — it is an adoption signal, not a deployment guarantee.
- Theme 2 — RAG with Graph: GraphRAG’s entity extraction and query modes are described in the README and arXiv 2404.16130. LightRAG’s four retrieval modes are in the README and arXiv 2410.05779 (EMNLP 2025 accepted). Graphiti’s temporal graph, provenance tracking, and MCP server are described in the README. I have not verified graph extraction quality at production corpus sizes; the warning about indexing cost in graphrag’s README reflects a real, documented constraint.
- Theme 3 — Databases Go AI-Native: mcp-toolbox’s supported database list (11+) is in the GitHub topics and README. pgai’s vectorizer architecture is described in the README including the architecture diagram and the decoupling design rationale. WrenAI’s semantic layer approach is described in the README tagline and documentation links. I have not run any of these three in production; pgai requires self-managed vectorizer workers that add operational overhead not visible in the quickstart.
Productivity Scorecard
| Tool | Theme | Domain | Eliminated Task | Documented Impact | Maturity |
|---|---|---|---|---|---|
| firecrawl/firecrawl | Agents as Operators | System Design | Per-site scraping pipeline | ”Handles rotating proxies, rate limits, JS-blocked content — zero configuration” (README) | GA |
| browser-use/browser-use | Agents as Operators | System Design | Per-site Playwright automation | ”Makes websites accessible for AI agents” (README); hosted cloud available | GA |
| OpenHands/OpenHands | Agents as Operators | Developer Productivity | Write-test-debug loop | 77.6% SWE-Bench score (README badge; spreadsheet linked) | GA |
| microsoft/graphrag | RAG with Graph | System Design | Multi-hop RAG via flat vector search | ”Unlocks LLM discovery on narrative private data” (MS Research blog, linked in README) | GA |
| HKUDS/LightRAG | RAG with Graph | System Design | Separate vector and graph indexes | 4 unified retrieval modes; EMNLP 2025 paper (arXiv 2410.05779) | GA |
| getzep/graphiti | RAG with Graph | System Design | Truncated message-list agent memory | ”Tracks how facts change over time, maintains provenance” (README) | GA |
| googleapis/mcp-toolbox | Databases Go AI-Native | Databases | Per-LLM per-database connector code | ”Instantly connect AI clients to 11+ databases” (README); Apache 2.0 | GA |
| Canner/WrenAI | Databases Go AI-Native | Databases | Schema-dump NL2SQL prompting | ”Agent doesn’t know what data means. We fix that.” (README); Apache 2.0 | GA |
| timescale/pgai | Databases Go AI-Native | Databases | External embedding sync pipeline | ”Automatically creates and synchronizes vector embeddings as data changes” (README) | GA |
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| graphrag indexing cost exceeds budget | LLM extraction runs against a large corpus without cost controls | Per the README: “start small.” Set per-run token budgets; test on a 50-document subset before indexing the full corpus |
| browser-use agent slower than scripted automation | High-frequency, deterministic web workflow running thousands of times per day | Use Playwright for predictable, high-volume flows; reserve browser-use for irregular or layout-change-prone tasks |
| firecrawl self-hosted proxy pool requires maintenance | Team self-hosts to avoid API rate limits and per-page costs | Evaluate hosted-service pricing vs. proxy infrastructure ops; the hosted tier removes the maintenance burden the README describes as “handled” |
| WrenAI semantic layer drifts after schema migration | Column renamed or table structure changed outside WrenAI’s MDL | Treat schema changes as requiring a semantic layer update; add MDL review to the migration checklist |
| pgai vectorizer worker outage causes embedding queue lag | Embedding API outage or worker process crash | Per README design: data writes are unaffected. Monitor vectorizer queue depth independently; alert when lag exceeds acceptable staleness for the use case |
| OpenHands agent generates correct but unconventional code | Agent produces code that passes tests but violates team conventions | Require human PR review before merge; use the SDK to constrain file scope available to the agent |
| LightRAG graph quality degrades on noisy input | Low-quality LLM used for indexing, or poorly structured input documents | Use the highest-quality available model for indexing (separate from the query model); re-index if retrieval quality drops |
| mcp-toolbox write-capable tool exposed to production agent | Custom tool allows INSERT or UPDATE without row-level restrictions | Restrict all production mcp-toolbox tools to read-only SQL; implement an explicit approval workflow before any write-capable tool is connected to a live agent |
| OpenHands coding agent + mcp-toolbox write access — agent runs DDL against production database | Agent generates schema-altering SQL via a write-capable mcp-toolbox tool | Scope mcp-toolbox to read-only connections; run OpenHands in sandbox environments isolated from production database write paths |
What to Carry into 2025
- Problem: The operator layer arrived in 2024 — agents can now act on websites, codebases, and databases — but agent memory and long-term context management remain fragile. Graphiti and graphrag solve parts of the problem, but production-grade multi-session agent memory with reliable temporal reasoning is not yet a solved category. The gap going into 2025 is persistent agent state at production scale.
- Solution: Three tools to evaluate now, one per domain, each GA with documented production readiness:
browser-usefor web-operating agents where site-specific scripting is the bottleneck (system design),pgaifor teams maintaining an external embedding cron that drifts from source data (databases), andmcp-toolboxfor teams that have written the same database connector more than twice across different AI integrations (databases and platform). - Proof: After 60 days on pgai, the embedding sync cron job should be gone. The vectorizer queue lag metric (observable in the tables pgai creates in PostgreSQL) replaces the custom pipeline monitor. If the cron still runs in parallel, the migration is incomplete and the team is operating two sources of truth for embeddings.
- Action: Install
pip install pgai, runpgai installagainst a development PostgreSQL instance, and create one vectorizer over the table you currently embed externally. Run both pipelines in parallel for two weeks and compare the embedding freshness and error rates. The first place they diverge will show exactly what the external pipeline was doing wrong — and whether pgai’s architecture handles it correctly for your workload.