About KARGA Markets
A technical deep-dive into our AI-powered natural language interface to a multi-source financial graph
The Problem
Traditional financial data platforms store information in isolated silos. Stock prices live in one database, government contracts in another, SEC filings in a third, and prediction markets somewhere else entirely.
When you want to answer questions like "Which S&P 500 companies with significant government contracts are mentioned in prediction markets?" - you'd need to manually query multiple systems, export data, and perform complex joins in spreadsheets.
There had to be a better way.
Why Knowledge Graphs?
1. Knowledge Graphs
Data stored as interconnected nodes and relationships using ArangoDB, enabling complex multi-hop queries across disparate data sources in milliseconds.
2. Natural Language to Graph Query
GPT-4 turns your question into precise AQL and runs it against ArangoDB. You get structured data from the graph—not retrieved passages. The model then synthesizes insights and follow-ups from those query results (and optional web search).
3. Semantic Search
Vector embeddings enable concept-based search - find "cybersecurity contracts" even when documents use terms like "network security" or "threat detection."
About KARGA
Technical Deep Dive →KARGA Markets is an AI-powered financial intelligence platform that transforms how analysts interact with multi-source data. Built on a foundation of graph database technology and large language models, KARGA enables natural language queries across traditionally siloed datasets including stock prices, government contracts, SEC filings, prediction markets, commodity futures, and energy data. Questions are translated into graph queries (AQL), executed, then summarized—no document retrieval step.
Mission
Democratize access to complex financial analysis by making cross-domain data exploration as simple as asking a question. No SQL required, no manual data joins, no isolated silos - just intelligence.
Technology
ArangoDB multi-model graph database, OpenAI GPT-4 for query generation and analysis, Next.js 14 frontend, FastAPI backend, Doc2Vec embeddings for semantic search, and automated data pipelines for daily updates.
Recent Enhancements
We're in beta and improving with your feedback.
Send feedback: karga.analytics@gmail.comGraph Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ Next.js 14 + React + Framer Motion + Tailwind CSS │
│ • Natural Language Query Input │
│ • Interactive Graph Visualization (ReactFlow) │
│ • Real-time Market Cards │
│ • Data Tables with Filtering/Sorting │
└─────────────────────┬───────────────────────────────────────────────┘
│ HTTPS / REST API
┌─────────────────────▼───────────────────────────────────────────────┐
│ FASTAPI BACKEND │
│ Python 3.13 + FastAPI + Pydantic │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Query Pipeline (Parallel Execution) │ │
│ │ ┌────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ GPT-4 │ │ Perplexity │ │ ArangoDB │ │ │
│ │ │ Intent │ │ Web Search │ │ Graph Query │ │ │
│ │ │ Detection │ │ (Current │ │ (Historical │ │ │
│ │ │ │ │ Events) │ │ Data) │ │ │
│ │ └────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │
│ │ │ │ │ │ │
│ │ └─────────────────┴───────────────────┘ │ │
│ │ │ │ │
│ │ ┌──────▼──────────┐ │ │
│ │ │ GPT-4 Synthesis │ │ │
│ │ │ Combines Results│ │ │
│ │ └──────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Security: Rate Limiting • Input Validation • HSTS Headers │
└─────────────────────┬───────────────────────────────────────────────┘
│ AQL Queries
┌─────────────────────▼───────────────────────────────────────────────┐
│ ARANGODB CLOUD │
│ Multi-Model Database (Document + Graph + Search) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Document │ │ Graph │ │ Vector │ │
│ │ Collections │ │ Edges │ │ Embeddings │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ • 612 companies (S&P 500) │
│ • 2M+ daily market data points (OHLCV + 40 indicators) │
│ • 100K+ government contract awards (with embeddings) │
│ • 7.5K SEC filings + 4.36M sentences (Doc2Vec embeddings) │
│ • 20K+ prediction markets (Polymarket + Kalshi) │
│ • 64K+ futures prices (CME commodities) │
│ • CFTC commodity positions + EIA energy data │
│ • Daily options flow (612 tickers) │
└───────────────────────────────────────────────────────────────────────┘Knowledge Graph Structure
Data isn't just stored—it's connected. Here's how relationships enable powerful queries:
Company → Market Data
HAS_MARKETDATA edges connect companies to their daily stock prices, enabling queries like "Show me tech companies with SMA_50 > SMA_200 (golden cross)"
Company → Government Awards
HAS_AWARD edges link companies to contracts, enabling semantic searches: "Defense companies with cybersecurity contracts over $10M"
Market → Company
market_mentions_company edges connect prediction markets to mentioned tickers: "Tesla reaches $300" → TSLA
Company → SEC Filings → Sentences
HAS_FILING → has_section → has_sentenceMulti-hop traversal for sentiment analysis: "Show negative FinBERT sentences from recent Apple 10-Ks"
Company → Commodity Positions
HAS_COMMODITY_POSITION links companies to CFTC data for commodity exposure analysis
Company → Options Flow
COMPANY_HAS_OPTIONS connects companies to daily options activity.OPTIONS_BEFORE_AWARD flags unusual activity before contract awards
CFTC → Futures → EIA Data
POSITION_ON_COMMODITY,INVENTORY_AFFECTS_PRICE,STORAGE_AFFECTS_PRICE enable supply/demand correlation analysis
SEC Sentences (Semantic Search)
Multi-hop traversal HAS_FILING → has_section → has_sentencewith Doc2Vec embeddings: "Find sentences discussing supply chain risks in energy sector 10-Ks"
Company → Commodities (Direct)
COMPANY_TRADES_COMMODITY links companies to futures_prices for commodity exposure (e.g. XOM → crude oil). Enables "energy stocks vs crude prices" without CFTC hop.
SEC Exhibits & XBRL
has_exhibit (sec_filings → sec_exhibits),has_xbrl_data (sec_filings → sec_xbrl_data) for structured financials and exhibit-level search.
Document Collections (21)
Vertex collections in the graph. Run introspect_schema.py for live counts.
Edge Collections (22)
Relationship edges for graph traversal.
Example Multi-Hop Query
"Find energy companies with government contracts mentioning 'renewable' that are mentioned in prediction markets with volume > $50k"
→ Traverses Company → Awards (semantic search) → Markets (graph join) in milliseconds
Data Universe
Stock Market Data
- •Source: Yahoo Finance API
- •Coverage: All S&P 500 companies
- •Data: OHLCV, volume, market cap, P/E ratios, technical indicators (SMA, EMA, MACD, RSI, Bollinger Bands)
- •Update Frequency: Daily via Airflow DAG
Government Contracts
- •Source: USASpending.gov API
- •Coverage: Federal contract awards to public companies
- •Data: Award amounts, agencies, descriptions, dates
- •Special Feature: Vector embeddings for semantic search (find "AI contracts" without exact keyword match)
Prediction Markets
- •Sources: Polymarket API, Kalshi API
- •Coverage: Politics, economics, sports, entertainment
- •Data: Probabilities, volumes, liquidity, traders
- •Connection: Graph edges link markets to mentioned companies (e.g., "Will Tesla reach $300?" → Tesla stock)
SEC Filings
- •Source: SEC EDGAR API
- •Types: 10-K (annual), 10-Q (quarterly), 8-K (events)
- •Processing: 7.5K filings parsed into 4.36M sentences
- •Sentiment: FinBERT scores for each sentence (-1 to +1)
- •Embeddings: Doc2Vec 300-dim financial domain embeddings for semantic search
Economic Indicators
- •Source: Federal Reserve Economic Data (FRED)
- •Data: S&P 500 index, Fed funds rate, unemployment, GDP, yield curves
- •Coverage: Historical time series data
Commodity Positions
- •Source: CFTC Commitments of Traders Report
- •Data: Long/short positions by trader type (commercial, non-commercial, retail)
- •Commodities: Oil, gold, wheat, corn, natural gas, etc.
Options Flow
- •Source: Options data via yfinance
- •Coverage: All S&P 500 companies (612 tickers)
- •Data: Call/put volumes, open interest, implied volatility, put/call ratios
- •Detection: Unusual activity flags for potential insider trading signals
Futures Prices
- •Source: CME Group via yfinance
- •Records: 64,000+ historical prices
- •Commodities: Crude oil, natural gas, gold, silver, copper, corn, wheat, soybeans
- •Indicators: OHLCV data plus technical indicators (RSI, MACD, SMA)
EIA Energy Data
- •Source: U.S. Energy Information Administration API
- •Data: Crude oil inventory, natural gas storage, production, LNG exports
- •Frequency: Weekly and monthly updates
- •Connection: Linked to futures prices for supply/demand analysis
AI Query Pipeline
When you ask a question, we use a two-step flow: natural language → structured JSON plan → deterministic AQL. No second LLM call for query generation.
Step 1: Intent Detection
GPT-4 classifies your query: Is it about a specific ticker (AAPL, MSFT) or a concept (AI, cybersecurity)? This determines whether to use exact matching or semantic search.
Intent: concept_query
Step 2a: JSON Query Plan (LLM)
GPT-4 receives the full database schema (collections, fields, edges) and outputs a structured JSON plan: primary collection, traversals (from/to/edge), filters, sort, limit, and bind_vars. No raw AQL yet—just a machine-readable plan.
{
"intent": "companies with semantic match on awards",
"primary_collection": "Award",
"traversals": [{"from_collection": "Award", "to_collection": "Company", "edge_collection": "HAS_AWARD"}],
"filters": {"Award.description_embedding": "cosine_similarity @query_vector >= 0.75"},
"bind_vars": {"query_vector": "[...]"}
}Step 2b: JSON → AQL (Deterministic)
A deterministic converter (json_to_aql) turns the JSON plan into executable AQL: FOR loops, edge traversals, FILTERs, SORT, LIMIT, and RETURN. No second LLM—consistent, auditable query generation.
FOR award IN Award
FILTER COSINE_SIMILARITY(award.description_embedding, @query_vector) >= 0.75
FOR edge_company IN HAS_AWARD
FILTER edge_company._from == award._id
FOR company IN Company
FILTER company._id == edge_company._to
RETURN {company, award}Step 3: Parallel Execution
Two paths run in parallel when needed:
- • Database Query: AQL executes against ArangoDB (historical data)
- • Web Search: Perplexity searches for current events (real-time context)
Step 4: Synthesis & Analysis
GPT-4 combines database results with optional web context, analyzes patterns, and generates:
- • Structured Analysis: Formatted results with key metrics
- • Insights: Trends, correlations, anomalies
- • Follow-up Questions: Suggested deeper dives
Technology Stack
Frontend
- Framework: Next.js 14 (App Router)
- UI: React 18, TypeScript
- Styling: Tailwind CSS
- Animations: Framer Motion
- Graph Viz: ReactFlow
- Hosting: Vercel
Backend
- Framework: FastAPI (Python 3.13)
- Validation: Pydantic
- Security: SlowAPI rate limiting
- LLM: OpenAI GPT-4
- Web Search: Perplexity AI
- Hosting: Railway
Database
- Platform: ArangoDB Cloud
- Type: Multi-model (Document + Graph)
- Query Language: AQL
- Embeddings: OpenAI text-embedding-3-small
- Size: ~5GB (2M+ documents)
- Location: Germany (GDPR compliant)
Data Pipeline
- Orchestration: Apache Airflow
- Processing: Python, Pandas, NumPy
- Sentiment: FinBERT
- Schedule: Daily updates at 2 AM UTC
- Monitoring: Airflow UI + logs
Performance & Scale
Performance Optimizations
- ✓Persistent indexes on ticker, date, volume fields
- ✓Skip-list indexes for range queries
- ✓Edge collections for O(1) relationship lookups
- ✓Query result caching (5-minute TTL)
- ✓Parallel DB + web search execution
- ✓Streaming results with batch_size=1000
Future Enhancements
Real-time Data
WebSocket connections for live market data updates, streaming prediction market probability changes as they happen.
Portfolio Tracking
User accounts to track favorite companies, save queries, and set up alerts for specific market conditions.
Advanced ML Models
Time-series forecasting with LSTM, anomaly detection for unusual trading patterns, correlation discovery between data sources.
More Data Sources
Twitter sentiment, Reddit discussions, earnings call transcripts, patent filings, and international market data.
Custom Dashboards
Drag-and-drop dashboard builder with custom charts, metrics, and KPIs tailored to individual research needs.
API Access
Public API with authentication for programmatic access to KARGA capabilities, enabling integrations with trading platforms.