About KARGA Markets

A technical deep-dive into our AI-powered natural language interface to a multi-source financial graph

The Problem

Traditional financial data platforms store information in isolated silos. Stock prices live in one database, government contracts in another, SEC filings in a third, and prediction markets somewhere else entirely.

When you want to answer questions like "Which S&P 500 companies with significant government contracts are mentioned in prediction markets?" - you'd need to manually query multiple systems, export data, and perform complex joins in spreadsheets.

There had to be a better way.

Why Knowledge Graphs?

1. Knowledge Graphs

Data stored as interconnected nodes and relationships using ArangoDB, enabling complex multi-hop queries across disparate data sources in milliseconds.

2. Natural Language to Graph Query

GPT-4 turns your question into precise AQL and runs it against ArangoDB. You get structured data from the graph—not retrieved passages. The model then synthesizes insights and follow-ups from those query results (and optional web search).

3. Semantic Search

Vector embeddings enable concept-based search - find "cybersecurity contracts" even when documents use terms like "network security" or "threat detection."

About KARGA

Technical Deep Dive →

KARGA Markets is an AI-powered financial intelligence platform that transforms how analysts interact with multi-source data. Built on a foundation of graph database technology and large language models, KARGA enables natural language queries across traditionally siloed datasets including stock prices, government contracts, SEC filings, prediction markets, commodity futures, and energy data. Questions are translated into graph queries (AQL), executed, then summarized—no document retrieval step.

Mission

Democratize access to complex financial analysis by making cross-domain data exploration as simple as asking a question. No SQL required, no manual data joins, no isolated silos - just intelligence.

Technology

ArangoDB multi-model graph database, OpenAI GPT-4 for query generation and analysis, Next.js 14 frontend, FastAPI backend, Doc2Vec embeddings for semantic search, and automated data pipelines for daily updates.

Recent Enhancements

•

SEC Sentence Embeddings: 4.36M sentences with Doc2Vec financial domain embeddings

•

Visual Query Builder: Smart field type detection with optimized operator selection

•

Enriched Results: AI inference engine with contextual insights

•

Options Flow: Unusual activity detection with 20-day baselines

We're in beta and improving with your feedback.

Send feedback: karga.analytics@gmail.com

Graph Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          USER INTERFACE                              │
│  Next.js 14 + React + Framer Motion + Tailwind CSS                  │
│  • Natural Language Query Input                                      │
│  • Interactive Graph Visualization (ReactFlow)                       │
│  • Real-time Market Cards                                            │
│  • Data Tables with Filtering/Sorting                                │
└─────────────────────┬───────────────────────────────────────────────┘
                      │ HTTPS / REST API
┌─────────────────────▼───────────────────────────────────────────────┐
│                     FASTAPI BACKEND                                  │
│  Python 3.13 + FastAPI + Pydantic                                    │
│                                                                       │
│  ┌──────────────────────────────────────────────────────┐           │
│  │  Query Pipeline (Parallel Execution)                  │           │
│  │  ┌────────────┐  ┌──────────────┐  ┌──────────────┐ │           │
│  │  │ GPT-4      │  │ Perplexity   │  │ ArangoDB     │ │           │
│  │  │ Intent     │  │ Web Search   │  │ Graph Query  │ │           │
│  │  │ Detection  │  │ (Current     │  │ (Historical  │ │           │
│  │  │            │  │  Events)     │  │  Data)       │ │           │
│  │  └────┬───────┘  └──────┬───────┘  └──────┬───────┘ │           │
│  │       │                 │                   │         │           │
│  │       └─────────────────┴───────────────────┘         │           │
│  │                         │                              │           │
│  │                  ┌──────▼──────────┐                  │           │
│  │                  │  GPT-4 Synthesis │                  │           │
│  │                  │  Combines Results│                  │           │
│  │                  └──────────────────┘                  │           │
│  └──────────────────────────────────────────────────────┘           │
│                                                                       │
│  Security: Rate Limiting • Input Validation • HSTS Headers           │
└─────────────────────┬───────────────────────────────────────────────┘
                      │ AQL Queries
┌─────────────────────▼───────────────────────────────────────────────┐
│                    ARANGODB CLOUD                                    │
│  Multi-Model Database (Document + Graph + Search)                   │
│                                                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │  Document    │  │  Graph       │  │  Vector      │              │
│  │  Collections │  │  Edges       │  │  Embeddings  │              │
│  └──────────────┘  └──────────────┘  └──────────────┘              │
│                                                                       │
│  • 612 companies (S&P 500)                                           │
│  • 2M+ daily market data points (OHLCV + 40 indicators)             │
│  • 100K+ government contract awards (with embeddings)                │
│  • 7.5K SEC filings + 4.36M sentences (Doc2Vec embeddings)          │
│  • 20K+ prediction markets (Polymarket + Kalshi)                    │
│  • 64K+ futures prices (CME commodities)                             │
│  • CFTC commodity positions + EIA energy data                        │
│  • Daily options flow (612 tickers)                                  │
└───────────────────────────────────────────────────────────────────────┘

Knowledge Graph Structure

Data isn't just stored—it's connected. Here's how relationships enable powerful queries:

Company → Market Data

HAS_MARKETDATA edges connect companies to their daily stock prices, enabling queries like "Show me tech companies with SMA_50 > SMA_200 (golden cross)"

Company → Government Awards

HAS_AWARD edges link companies to contracts, enabling semantic searches: "Defense companies with cybersecurity contracts over $10M"

Market → Company

market_mentions_company edges connect prediction markets to mentioned tickers: "Tesla reaches $300" → TSLA

Company → SEC Filings → Sentences

HAS_FILING → has_section → has_sentenceMulti-hop traversal for sentiment analysis: "Show negative FinBERT sentences from recent Apple 10-Ks"

Company → Commodity Positions

HAS_COMMODITY_POSITION links companies to CFTC data for commodity exposure analysis

Company → Options Flow

COMPANY_HAS_OPTIONS connects companies to daily options activity.OPTIONS_BEFORE_AWARD flags unusual activity before contract awards

CFTC → Futures → EIA Data

POSITION_ON_COMMODITY,INVENTORY_AFFECTS_PRICE,STORAGE_AFFECTS_PRICE enable supply/demand correlation analysis

SEC Sentences (Semantic Search)

Multi-hop traversal HAS_FILING → has_section → has_sentencewith Doc2Vec embeddings: "Find sentences discussing supply chain risks in energy sector 10-Ks"

Company → Commodities (Direct)

COMPANY_TRADES_COMMODITY links companies to futures_prices for commodity exposure (e.g. XOM → crude oil). Enables "energy stocks vs crude prices" without CFTC hop.

SEC Exhibits & XBRL

has_exhibit (sec_filings → sec_exhibits),has_xbrl_data (sec_filings → sec_xbrl_data) for structured financials and exhibit-level search.

Document Collections (21)

Vertex collections in the graph. Run introspect_schema.py for live counts.

CompanyMarketDataAwardEconomicDatacommodity_positionsfutures_pricesoptions_floweia_crude_inventoryeia_natgas_storageeia_natgas_productioneia_lng_exportssec_filingssec_sectionssec_sentencessec_exhibitssec_xbrl_dataprediction_markets_polymarketprediction_markets_kalshipolymarket_traderspolymarket_positionspolymarket_price_history

Edge Collections (22)

Relationship edges for graph traversal.

HAS_MARKETDATAHAS_AWARDHAS_COMMODITY_POSITIONHAS_FILINGCOMPANY_TRADES_COMMODITYCOMPANY_HAS_OPTIONSHAS_OPTIONS_ACTIVITYOPTIONS_BEFORE_AWARDOPTIONS_BEFORE_FILINGPOSITION_ON_COMMODITYINVENTORY_AFFECTS_PRICESTORAGE_AFFECTS_PRICEMACRO_IMPACTS_COMMODITYhas_sectionhas_sentencehas_exhibithas_xbrl_datamarket_mentions_company_polymarketmarket_related_to_sector_polymarketmarket_affects_company_polymarketmarket_mentions_company_kalshimarket_related_to_sector_kalshitrader_has_positionposition_in_market

Example Multi-Hop Query

"Find energy companies with government contracts mentioning 'renewable' that are mentioned in prediction markets with volume > $50k"

→ Traverses Company → Awards (semantic search) → Markets (graph join) in milliseconds

Data Universe

Stock Market Data

•Source: Yahoo Finance API
•Coverage: All S&P 500 companies
•Data: OHLCV, volume, market cap, P/E ratios, technical indicators (SMA, EMA, MACD, RSI, Bollinger Bands)
•Update Frequency: Daily via Airflow DAG

Government Contracts

•Source: USASpending.gov API
•Coverage: Federal contract awards to public companies
•Data: Award amounts, agencies, descriptions, dates
•Special Feature: Vector embeddings for semantic search (find "AI contracts" without exact keyword match)

Prediction Markets

•Sources: Polymarket API, Kalshi API
•Coverage: Politics, economics, sports, entertainment
•Data: Probabilities, volumes, liquidity, traders
•Connection: Graph edges link markets to mentioned companies (e.g., "Will Tesla reach $300?" → Tesla stock)

SEC Filings

•Source: SEC EDGAR API
•Types: 10-K (annual), 10-Q (quarterly), 8-K (events)
•Processing: 7.5K filings parsed into 4.36M sentences
•Sentiment: FinBERT scores for each sentence (-1 to +1)
•Embeddings: Doc2Vec 300-dim financial domain embeddings for semantic search

Economic Indicators

•Source: Federal Reserve Economic Data (FRED)
•Data: S&P 500 index, Fed funds rate, unemployment, GDP, yield curves
•Coverage: Historical time series data

Commodity Positions

•Source: CFTC Commitments of Traders Report
•Data: Long/short positions by trader type (commercial, non-commercial, retail)
•Commodities: Oil, gold, wheat, corn, natural gas, etc.

Options Flow

•Source: Options data via yfinance
•Coverage: All S&P 500 companies (612 tickers)
•Data: Call/put volumes, open interest, implied volatility, put/call ratios
•Detection: Unusual activity flags for potential insider trading signals

Futures Prices

•Source: CME Group via yfinance
•Records: 64,000+ historical prices
•Commodities: Crude oil, natural gas, gold, silver, copper, corn, wheat, soybeans
•Indicators: OHLCV data plus technical indicators (RSI, MACD, SMA)

EIA Energy Data

•Source: U.S. Energy Information Administration API
•Data: Crude oil inventory, natural gas storage, production, LNG exports
•Frequency: Weekly and monthly updates
•Connection: Linked to futures prices for supply/demand analysis

AI Query Pipeline

When you ask a question, we use a two-step flow: natural language → structured JSON plan → deterministic AQL. No second LLM call for query generation.

Step 1: Intent Detection

GPT-4 classifies your query: Is it about a specific ticker (AAPL, MSFT) or a concept (AI, cybersecurity)? This determines whether to use exact matching or semantic search.

Input: "Show me AI companies with government contracts"
Intent: concept_query

Step 2a: JSON Query Plan (LLM)

GPT-4 receives the full database schema (collections, fields, edges) and outputs a structured JSON plan: primary collection, traversals (from/to/edge), filters, sort, limit, and bind_vars. No raw AQL yet—just a machine-readable plan.

{
  "intent": "companies with semantic match on awards",
  "primary_collection": "Award",
  "traversals": [{"from_collection": "Award", "to_collection": "Company", "edge_collection": "HAS_AWARD"}],
  "filters": {"Award.description_embedding": "cosine_similarity @query_vector >= 0.75"},
  "bind_vars": {"query_vector": "[...]"}
}

Step 2b: JSON → AQL (Deterministic)

A deterministic converter (json_to_aql) turns the JSON plan into executable AQL: FOR loops, edge traversals, FILTERs, SORT, LIMIT, and RETURN. No second LLM—consistent, auditable query generation.

FOR award IN Award
  FILTER COSINE_SIMILARITY(award.description_embedding, @query_vector) >= 0.75
  FOR edge_company IN HAS_AWARD
    FILTER edge_company._from == award._id
    FOR company IN Company
      FILTER company._id == edge_company._to
      RETURN {company, award}

Step 3: Parallel Execution

Two paths run in parallel when needed:

• Database Query: AQL executes against ArangoDB (historical data)
• Web Search: Perplexity searches for current events (real-time context)

Step 4: Synthesis & Analysis

GPT-4 combines database results with optional web context, analyzes patterns, and generates:

• Structured Analysis: Formatted results with key metrics
• Insights: Trends, correlations, anomalies
• Follow-up Questions: Suggested deeper dives

Technology Stack

Frontend

Framework: Next.js 14 (App Router)
UI: React 18, TypeScript
Styling: Tailwind CSS
Animations: Framer Motion
Graph Viz: ReactFlow
Hosting: Vercel

Backend

Framework: FastAPI (Python 3.13)
Validation: Pydantic
Security: SlowAPI rate limiting
LLM: OpenAI GPT-4
Web Search: Perplexity AI
Hosting: Railway

Database

Platform: ArangoDB Cloud
Type: Multi-model (Document + Graph)
Query Language: AQL
Embeddings: OpenAI text-embedding-3-small
Size: ~5GB (2M+ documents)
Location: Germany (GDPR compliant)

Data Pipeline

Orchestration: Apache Airflow
Processing: Python, Pandas, NumPy
Sentiment: FinBERT
Schedule: Daily updates at 2 AM UTC
Monitoring: Airflow UI + logs

Performance & Scale

< 3s

Average Query Time

(DB + AI analysis)

2M+

Documents in Graph

(Companies, markets, filings)

50ms

Graph Traversal Time

(3-hop relationships)

Performance Optimizations

✓Persistent indexes on ticker, date, volume fields
✓Skip-list indexes for range queries
✓Edge collections for O(1) relationship lookups
✓Query result caching (5-minute TTL)
✓Parallel DB + web search execution
✓Streaming results with batch_size=1000

Future Enhancements

Real-time Data

WebSocket connections for live market data updates, streaming prediction market probability changes as they happen.

Portfolio Tracking

User accounts to track favorite companies, save queries, and set up alerts for specific market conditions.

Advanced ML Models

Time-series forecasting with LSTM, anomaly detection for unusual trading patterns, correlation discovery between data sources.

More Data Sources

Twitter sentiment, Reddit discussions, earnings call transcripts, patent filings, and international market data.

Custom Dashboards

Drag-and-drop dashboard builder with custom charts, metrics, and KPIs tailored to individual research needs.

API Access

Public API with authentication for programmatic access to KARGA capabilities, enabling integrations with trading platforms.

← Back to Platform