ForensicSQL-Llama-3.2-3B

Model Description

ForSQLiteLM (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized for generating SQLite queries from natural language requests against mobile forensic databases. The model converts investigative questions into executable SQL queries across a wide range of forensic artifact databases — WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat, Instagram, blockchain wallets, and many more.

This model was developed as part of a research project and accompanying journal paper investigating LLM fine-tuning for forensic database analysis, and is integrated into FQLite, an established open-source forensic analysis tool.

Key result: 91.0% execution accuracy on a 100-example held-out test set — within 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions (McNemar test: p ≈ 0.39, not significant at α = 0.05), while running fully locally with no internet connectivity required.

Model Details

Property Value
Base Model meta-llama/Llama-3.2-3B-Instruct
Fine-tuning Method Full fine-tune (bf16)
Training Dataset SQLiteDS — 800 training examples, 191 forensic artifact categories
Training Framework Hugging Face Transformers
Best Val Loss 0.3043 (7 epochs)
Model Size (bf16) ~6 GB
Hardware Required 16 GB unified memory (Apple M-series) or equivalent GPU

Performance

Overall Results (fixed dataset, n=100, best configuration)

Metric Value
Overall Accuracy 91.0% (91/100)
95% CI (Wilson) [83.8%, 95.2%]
Executable Queries 92/100
GPT-4o Accuracy 95.0% (gap: 4 pp, p ≈ 0.39)
Base Model (no fine-tuning) 35.0%
Improvement over base +56 pp

Accuracy by Query Difficulty

Difficulty Accuracy n 95% CI vs. GPT-4o
Easy (single-table) 95.1% 39/41 [83.9%, 98.7%] 0.0 pp
Medium (joins, aggregation) 87.5% 28/32 [71.9%, 95.0%] 0.0 pp
Hard (CTEs, window functions) 88.9% 24/27 [71.9%, 96.1%] −3.7 pp

ForSQLiteLM matches GPT-4o exactly on Easy and Medium queries. The remaining gap is concentrated on Hard queries (complex CTEs, window functions, multi-table joins).

Accuracy by Forensic Domain

Domain Accuracy n 95% CI
Messaging & Social 100.0% 28/28 [87.9%, 100.0%]
Android Artifacts 94.4% 17/18 [74.2%, 99.0%]
Productivity & Other 88.9% 16/18 [67.2%, 96.9%]
iOS CoreData 84.0% 21/25 [65.3%, 93.6%]
Finance & Crypto 81.8% 9/11 [52.3%, 94.9%]

Prompt Configuration Ablation

Configuration Overall Easy Medium Hard iOS
WITHOUT App Name 91.0% 95.1% 87.5% 88.9% 84.0%
WITH App Name 88.0% 92.7% 87.5% 81.5% 88.0%

★ Primary configuration — omitting the application name from the prompt yields 3 pp higher overall accuracy. Interestingly, including the app name helps iOS CoreData schemas (+4 pp) but hurts Hard queries (−7.4 pp); the primary configuration without app name is recommended for general use.

Post-Processing Pipeline Contribution

Component Queries saved
Execution feedback (retry) 7
Alias normalization 18
Column corrections (Levenshtein) 2

Training Progression

Configuration Val Loss Accuracy Δ
Base model (no fine-tuning) 35.0%
Fine-tuned, no augmentation 68.0% +33 pp
+ Data augmentation (3.4×) 74.0% +6 pp
+ Extended training (7 epochs) 0.3617 84.0% +10 pp
+ Post-processing pipeline 0.3617 87.0% +3 pp
+ Execution feedback 0.3617 90.0% +3 pp
+ Corrected training dataset (v5) 0.3043 91.0% +1 pp

Intended Use

Primary Use Cases

  • Mobile forensics investigations: automated SQL query drafting against seized device databases
  • Integration into forensic tools (FQLite, Autopsy, ALEAPP/iLEAPP workflows)
  • Research in domain-specific Text-to-SQL
  • Educational use for learning forensic database analysis

Important: This Model is a Drafting Assistant

ForSQLiteLM is not a replacement for SQL expertise. It generates candidate queries that require review by a practitioner with sufficient SQL knowledge before any reliance is placed on their results. The 91.0% accuracy means approximately 1 in 11 queries contains an error. In court-admissible or case-critical work, all outputs must be independently validated.

Out-of-Scope Use

  • Autonomous forensic decision-making without human review
  • Production systems requiring >95% guaranteed accuracy
  • General-purpose SQL generation outside the forensic domain
  • Non-SQLite databases (PostgreSQL, MySQL, etc.)

How to Use

Quick Start (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model.eval()

schema = """
CREATE TABLE message (
    ROWID INTEGER PRIMARY KEY,
    text TEXT,
    handle_id INTEGER,
    date INTEGER,
    is_from_me INTEGER,
    cache_has_attachments INTEGER
);
CREATE TABLE handle (
    ROWID INTEGER PRIMARY KEY,
    id TEXT,
    service TEXT
);
"""

request = "Find all messages received in the last 7 days that contain attachments"

# Note: do NOT use apply_chat_template — use plain-text prompt
prompt = f"""Generate a valid SQLite query for this forensic database request.

Database Schema:
{schema}

Request: {request}

SQLite Query:
"""

inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=300,
        do_sample=False,        # greedy decoding — do not change
    )

input_length = inputs['input_ids'].shape[1]
sql = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
print(sql.strip())

Important: Use plain-text tokenization (do not call apply_chat_template). The model was trained and evaluated with a plain-text prompt format. Use do_sample=False (greedy decoding) for reproducible results.

Python Helper Class

class ForensicSQLGenerator:
    def __init__(self, model_name="pawlaszc/ForensicSQL-Llama-3.2-3B"):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        self.model.eval()

    def generate_sql(self, schema: str, request: str) -> str:
        prompt = (
            "Generate a valid SQLite query for this forensic database request.\n\n"
            f"Database Schema:\n{schema}\n\n"
            f"Request: {request}\n\n"
            "SQLite Query:\n"
        )
        inputs = self.tokenizer(
            prompt, return_tensors="pt", truncation=True, max_length=2048
        )
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        input_length = inputs["input_ids"].shape[1]

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs, max_new_tokens=300, do_sample=False
            )

        sql = self.tokenizer.decode(
            outputs[0][input_length:], skip_special_tokens=True
        )
        # Return first statement only, normalized
        return sql.strip().split("\n")[0].strip().rstrip(";") + ";"


# Usage
generator = ForensicSQLGenerator()
sql = generator.generate_sql(schema, "Find all unread messages from the last 24 hours")
print(sql)

With Ollama / llama.cpp (GGUF)

# With llama.cpp
./llama-cli -m forensic-sql-q4_k_m.gguf \
  --temp 0 \
  -p "Generate a valid SQLite query for this forensic database request.

Database Schema:
CREATE TABLE sms (_id INTEGER PRIMARY KEY, address TEXT, body TEXT, date INTEGER);

Request: Find all messages sent after midnight

SQLite Query:"

# With Ollama — create a Modelfile
cat > Modelfile << 'EOF'
FROM ./forensic-sql-q4_k_m.gguf
PARAMETER temperature 0
PARAMETER num_predict 300
EOF

ollama create forensic-sql -f Modelfile
ollama run forensic-sql

Training Details

Dataset — SQLiteDS

  • Total examples: 1,000 (800 train / 100 val / 100 test), fixed random seed 42
  • Forensic artifact categories: 191
  • Reference query validation: All 1,000 reference queries validated for execution correctness against in-memory SQLite; 50 queries (5%) corrected before final training
  • Augmentation: 3.4× expansion via instruction paraphrasing, WHERE clause reordering, and LIMIT injection — augmented examples confined to training split only
  • Dataset: pawlaszc/mobile-forensics-sql
  • License: CC BY 4.0

Hyperparameters

Parameter Value
Training method Full fine-tune (no LoRA)
Precision bfloat16
Epochs 7
Learning rate 2e-5 (peak)
LR scheduler Cosine with warmup
Batch size 1 + gradient accumulation 4
Max sequence length 2048
Optimizer AdamW
Hardware Apple M-series, 16 GB unified memory
Training time ~17.6 hours
Best val loss 0.3043 (epoch 7)

Key Training Insight: Sequence Length

Early training runs with max_seq_length=512 truncated 92% of examples, causing the model to learn schema generation (CREATE TABLE) instead of queries — resulting in only ~50% accuracy. Setting max_seq_length=2048 eliminated truncation and improved accuracy from 50% to 68% before augmentation, and to 91% after all training components were applied.

Limitations

Known Issues

  1. iOS CoreData Schemas (84.0%): The Z-prefix column naming convention (e.g., ZISFROMME, ZTIMESTAMP) provides no semantic signal from column names alone, making these schemas harder to reason about.
  2. Hard Queries — 3.7 pp gap to GPT-4o: Complex CTEs, recursive queries, and window functions are the primary remaining challenge.
  3. Finance & Crypto (81.8%, n=11): Small test set; confidence intervals are wide. Interpret with caution.
  4. ~1 in 11 error rate: Approximately 9% of generated queries will contain errors. Expert review of all outputs is required before use in investigations.

When Human Review is Especially Important

  • Complex multi-table queries with CTEs or window functions
  • Case-critical or court-admissible investigations
  • Any query that will be used to draw conclusions about a suspect
  • Queries involving rare or unusual forensic artifact schemas

Evaluation

  • Test set: 100 examples, held-out, seed=42, non-augmented
  • Metric: Execution accuracy — query is correct iff it executes without error AND returns a result set identical to the reference query
  • Reference validation: All reference queries validated for execution correctness before evaluation; 5 broken queries in the test set were corrected
  • Evaluation script: Available in the dataset repository on Zenodo ([DOI])

Citation

If you use this model or the SQLiteDS dataset in your research, please cite:

@article{pawlaszczyk2026forsqlitelm,
  author  = {Dirk Pawlaszczyk},
  title   = {AI-Based Automated SQL Query Generation for SQLite Databases
             in Mobile Forensics},
  journal = {Forensic Science International: Digital Investigation},
  year    = {2026},
  note    = {FSIDI-D-26-00029}
}

License

Apache 2.0 — following the base Llama 3.2 license terms.

Acknowledgments

  • Base model: Meta's Llama 3.2-3B-Instruct
  • Training framework: Hugging Face Transformers
  • Forensic tool integration: FQLite
  • Schema sources: iLEAPP, ALEAPP, Autopsy (used under their respective open-source licenses)

Additional Resources


Disclaimer: ForSQLiteLM is intended for research and forensic practitioner use. All generated SQL queries must be reviewed by a qualified practitioner before execution in live forensic investigations. The authors accept no liability for incorrect conclusions drawn from unvalidated model outputs.

Downloads last month
248
Safetensors
Model size
3B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pawlaszc/DigitalForensicsText2SQLite

Quantized
(106)
this model

Dataset used to train pawlaszc/DigitalForensicsText2SQLite

Evaluation results

  • Overall Accuracy (without app name) on SQLiteDS — Mobile Forensics SQL Dataset (corrected)
    self-reported
    91.000
  • Easy Queries Accuracy on SQLiteDS — Mobile Forensics SQL Dataset (corrected)
    self-reported
    95.100
  • Medium Queries Accuracy on SQLiteDS — Mobile Forensics SQL Dataset (corrected)
    self-reported
    87.500
  • Hard Queries Accuracy on SQLiteDS — Mobile Forensics SQL Dataset (corrected)
    self-reported
    88.900