Model Benchmarks

Real-World Testing For Off-Grid Intelligence

Introduction

A lot of thought, research and development went into creating the OffGrid AI ToolKit. We pushed the technology to its limits, knowing this product might be used in survival or even life-or-death situations (please use responsibly and read our disclaimers).

Creating a portable AI solution that runs entirely from a flash drive presented unique challenges (see How It Works). One critical decision was choosing which AI models would work best for REAL WORLD situations.

While there's a wide selection of open-source AI models with published benchmarks, they weren't tested on the questions that matter for survival and field use. So we created our own rigorous testing methodology.

Model selection was just the first step. Lots of other testing went into this beyond choosing the right models. For example, here's more about our Ready-Made Prompt testing and what went into that.

Why We Built Our Own Benchmarks

We didn't just test models on survival knowledge. We tested their ability to think. Every person, situation, and circumstance is unique. That's where survival books, PDFs, and videos fall short.

Our testing broke into two critical categories:

1. Survival & Emergency Knowledge

Examples:

What are the best ways to purify water from a desert stream using minimal gear?
Explain how to prioritize survival tasks if stranded in the wilderness with no supplies.
What are the signs of dehydration and how should it be treated in the field?

2. Intelligence & Problem Solving

Examples:

You have 3 boxes: one contains only apples, one only oranges, and one a mix. They are all mislabeled. You can reach into one box and take out one fruit. Which box should you pick to correctly re-label all three?
You're in a room with two doors. One leads to certain death, the other to freedom. One guard always lies, the other always tells the truth. You may ask one question to one guard. What do you ask?

Key Finding: Our methodical tests revealed performance patterns that didn't match published benchmarks. The winner was very clear: Gemma models consistently outperformed all others for real-world applications.

Why These Models Made the Cut

The Gemma3 Family: 27B, 12B, and 4B

After testing 15+ model families, Gemma3 models dominated both survival knowledge AND problem-solving intelligence. They're not just regurgitating facts. They're thinking through scenarios.

Gemma3-27b: Maximum intelligence for complex scenarios. The "strategist" model.
Gemma3-12b: Perfect balance of capability and speed. The "planner" model.
Gemma3-4b: Quick responses for basic queries. The "field guide" model.

MedGemma: Specialized Medical Intelligence

Fine-tuned on medical literature, MedGemma provides field-appropriate medical guidance while emphasizing when professional care is needed. Remember: This is educational only. Always seek proper medical attention when available.

Section 3: Survival AI Testing

Results from 300+ Survival-Focused Prompts

Rank	Model	Accuracy /10	Reasoning /10	Clarity /10	Offline Fit /10	Avg Score	Notes
🥇1	Gemma3:27b	9.95	9.9	9.85	9.92	9.91	Most comprehensive, adaptable responses.
🥈2	Gemma3:12b	9.9	9.8	9.82	9.8	9.83	Nearly as accurate, faster, more concise.
🥉3	Gemma3:4b	9.6	9.3	9.5	9.3	9.43	Clear, to the point, beginner-friendly.
4	Deepseek-r1:14b	9.1	9	9.8	8.74	9.16	Good general knowledge, less adaptive.
5	Deepseek-r1:32b	8.9	9.1	9.3	8.5	8.95	Uneven performance, some errors.
6	Deepseek-r1:7b	8.5	7.7	8.15	7.0	7.81	Missed critical details.

Section 4: Intelligence Testing

Problem-Solving and Reasoning Performance

Rank	Model	Accuracy /10	Reasoning /10	Clarity /10	Offline Fit /10	Avg Score	Notes
🥇1	Gemma3:27b	9	9	9	8	8.8	Methodical, structured, rarely fooled.
🥈2	Gemma3:12b	9	8	9	8	8.5	Almost as strong, slightly denser wording.
🥉3	Gemma3:4b	9	8	9	7	8.3	Clear and concise, best for quick answers.

Key Finding: Survival is about more than memorized facts. It's about thinking under pressure. Gemma3 models consistently demonstrated superior problem-solving and logical reasoning.

Hardware Requirements

Note: First run response times are slower as models load into memory. Subsequent queries run significantly faster once loaded.

Model	First Run Response	After Loaded	RAM Required	Best For
Gemma3-4b	30-90 seconds	15-60 seconds	8GB+	Quick queries, basic tasks
Gemma3-12b	2-3 minutes	1-2 minutes	16GB+	Complex analysis, wider knowledge
Gemma3-27b	~10 minutes	4-5 minutes	32GB+	Maximum intelligence, deep thinking
MedGemma-4b	30-90 seconds	15-60 seconds	8GB+	Medical information, field health

Disclaimer: Your times might differ but should be close to these. These were averages from hundreds of tests on dozens of computers.

Why Gemma Won (And Why It's All We Include)

Pros:

Consistent high accuracy in reasoning and survival scenarios
Step-by-step explanations with adaptable strategies
Handles tricky or adversarial prompts without breaking
Works fully offline across all model sizes
Vision capabilities for analyzing images in the field
Knowledge current through August 2024
Optimized for USB drive deployment

The Reality:

Sometimes verbose (but thorough is better than wrong)
Requires patience for larger models
Not as fast as cloud AI (but works anywhere)
The 4b model, while quickest, is the smallest. Complex questions should be verified with one of the larger models.

For realistic expectations about performance, see What to Expect →

Transparent Approach

We made our testing framework available to show exactly how we evaluated these models.
No black box. No marketing hype.

Here are our actual Google Docs with testing framework and unaltered results from all individual tests. This doesn't include our real-world field testing, which was done in actual and scripted survival situations.

View Our Unedited Model Benchmark Research Docs →