Skip to main content
Run models locally on your hardware with Ollama, or use Ollama Cloud for hosted inference.
Performance Notice: Models below 30 billion parameters have shown significantly lower performance on agentic coding tasks. While smaller models (7B, 13B) can be useful for experimentation and learning, they are generally not recommended for production coding work or complex software engineering tasks.

Local Ollama

Run models entirely on your machine with no internet required.

Configuration

Configuration examples for ~/.factory/config.json:
{
  "custom_models": [
    {
      "model_display_name": "Qwen 2.5 Coder 32B [Local]",
      "model": "qwen2.5-coder:32b",
      "base_url": "http://localhost:11434/v1",
      "api_key": "not-needed",  # add any non-empty value
      "provider": "generic-chat-completion-api",
      "max_tokens": 16000
    },
    {
      "model_display_name": "Qwen 2.5 Coder 7B [Local]",
      "model": "qwen2.5-coder:7b",
      "base_url": "http://localhost:11434/v1",
      "api_key": "not-needed",  # add any non-empty value
      "provider": "generic-chat-completion-api",
      "max_tokens": 4000
    }
  ]
}

Setup

Context Window Configuration: For optimal performance with Factory, ensure you set the context window to at least 32,000 tokens. You can either:
  • Use the context window slider in the Ollama app (set to 32k minimum)
  • Set environment variable before starting: OLLAMA_CONTEXT_LENGTH=32000 ollama serve
Without adequate context, the experience will be significantly degraded.
  1. Install Ollama from ollama.com/download
  2. Pull desired models:
    # Recommended models
    ollama pull qwen2.5-coder:32b
    ollama pull qwen2.5-coder:7b
    
  3. Start the Ollama server with extra context:
    OLLAMA_CONTEXT_LENGTH=32000 ollama serve
    
  4. Add configurations to Factory config

Approximate Hardware Requirements

Model SizeRAM RequiredVRAM (GPU)
3B params4GB3GB
7B params8GB6GB
13B params16GB10GB
30B params32GB20GB
70B params64GB40GB

Ollama Cloud

Use Ollama’s cloud service for hosted model inference without local hardware requirements The best performance for agentic coding has been observed with qwen3-coder:480b. For a full list of available cloud models, visit: ollama.com/search?c=cloud

Configuration

{
  "custom_models": [
    {
      "model_display_name": "qwen3-coder [Online]",
      "model": "qwen3-coder:480b-cloud",
      "base_url": "http://localhost:11434/v1/",
      "api_key": "not-needed",  # add any non-empty value
      "provider": "generic-chat-completion-api",
      "max_tokens": 128000
    }
  ]
}

Getting Started with Cloud Models

  1. Ensure Ollama is installed and running locally
  2. Cloud models are accessed through your local Ollama instance - no API key needed
  3. Add the configuration above to your Factory config
  4. The model will automatically use cloud compute when requested

Troubleshooting

Local server not connecting

  • Ensure Ollama is running: ollama serve
  • Check if port 11434 is available
  • Try curl http://localhost:11434/api/tags to test

Model not found

  • Pull the model first: ollama pull model-name
  • Check exact model name with ollama list

Notes

  • Local API doesn’t require authentication (use any placeholder for api_key)
  • Models are stored in ~/.ollama/models/
I