Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / DEMO.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 26 days ago

preview code

raw

history blame contribute delete

11.3 kB

Demo: SQLEnv — Flat OpenEnv Environment with Action Dispatch

Generated: 2026-02-28T14:26Z Branch: refactor-openenv-tutorial-project-structure @ f28bfaa Environment: Python 3.12.3, torch 2.2.2, MockTokenizer (no Ollama required)

What This Branch Does

This branch refactors the sql-env project from a nested envs/sql_env/ layout into the canonical flat openenv init structure, and integrates the action-feature branch's core action dispatch system.

The result: a working RL environment where an agent sends natural language messages (e.g. "describe the students table"), the environment classifies them into action types (describe/sample/query), dispatches to the appropriate handler, and returns tokenized observations for RL training. All of this runs without external services — MockTokenizer replaces HuggingFace tokenizers and Ollama failures are handled gracefully.

Quickstart

git checkout refactor-openenv-tutorial-project-structure
uv sync
uv run pytest tests/ -v          # 21 tests, ~3.5s

Prerequisites: Python 3.11-3.12, uv. Optional: Ollama with llama3.2 for LLM-guided table selection (not needed for demo).

Evidence

1. All 21 Tests Pass

$ uv run pytest tests/ -v

tests/test_smoke.py::TestModels::test_action_creation PASSED             [  4%]
tests/test_smoke.py::TestModels::test_action_with_tokens PASSED          [  9%]
tests/test_smoke.py::TestModels::test_observation_creation PASSED        [ 14%]
tests/test_smoke.py::TestModels::test_state_creation PASSED              [ 19%]
tests/test_smoke.py::TestEnvironment::test_instantiation PASSED          [ 23%]
tests/test_smoke.py::TestEnvironment::test_reset_returns_observation PASSED [ 28%]
tests/test_smoke.py::TestEnvironment::test_reset_with_empty_prompt PASSED [ 33%]
tests/test_smoke.py::TestEnvironment::test_reset_creates_new_episode PASSED [ 38%]
tests/test_smoke.py::TestEnvironment::test_step_describe PASSED          [ 42%]
tests/test_smoke.py::TestEnvironment::test_step_sample PASSED            [ 47%]
tests/test_smoke.py::TestEnvironment::test_tokens_grow_across_turns PASSED [ 52%]
tests/test_smoke.py::TestActionDetection::test_describe_keywords PASSED  [ 57%]
tests/test_smoke.py::TestActionDetection::test_sample_keywords PASSED    [ 61%]
tests/test_smoke.py::TestActionDetection::test_query_default PASSED      [ 66%]
tests/test_smoke.py::TestMessageToAction::test_creates_action PASSED     [ 71%]
tests/test_smoke.py::TestMessageToAction::test_appends_to_history PASSED [ 76%]
tests/test_smoke.py::TestMessageToAction::test_validates_input PASSED    [ 80%]
tests/test_smoke.py::TestClientSerialization::test_step_payload_serialization PASSED [ 85%]
tests/test_smoke.py::TestClientSerialization::test_parse_result_deserialization PASSED [ 90%]
tests/test_smoke.py::TestSchemaIntrospection::test_get_table_schema PASSED [ 95%]
tests/test_smoke.py::TestSchemaIntrospection::test_unknown_table PASSED  [100%]

============================== 21 passed in 3.56s ==============================

Tests cover: Pydantic models, environment lifecycle, action detection, message-to-action conversion, client tensor serialization, and schema introspection.

2. Lint and Format Clean

$ uv run ruff check .
All checks passed!

$ uv run ruff format --check .
14 files already formatted

3. Pydantic Model Contracts

>>> from sql_env.models import SQLAction, SQLObservation, SQLState

SQLAction fields:    ['metadata', 'action_type', 'action_description', 'tokens']
SQLObservation fields: ['done', 'reward', 'metadata', 'messages', 'tokens']
SQLState fields:     ['episode_id', 'step_count', 'history_messages', 'history_tokens', 'current_action_type']

SQLAction.tokens and SQLObservation.tokens carry torch tensors. SQLState.history_messages / history_tokens accumulate the full conversation for RL context.

4. Action Type Detection

The environment classifies natural language messages into action types via keyword matching:

  [PASS] "describe the students table..." -> describe
  [PASS] "what columns does Course have..." -> describe
  [PASS] "show me the schema..." -> describe
  [PASS] "show me sample rows from students..." -> sample
  [PASS] "give me example data..." -> sample
  [PASS] "how many rows are in Courses..." -> sample
  [PASS] "find all students enrolled in CS101..." -> query
  [PASS] "select count(*) from students..." -> query
  [PASS] "what is the average score..." -> query

Keywords like "describe"/"schema"/"columns" trigger describe; "sample"/"example"/"rows" trigger sample; everything else defaults to query.

5. MockTokenizer Roundtrip

>>> from server.test_sql_env import MockTokenizer
>>> tok = MockTokenizer()
>>> msg = [{'role': 'user', 'content': 'describe the students table'}]
>>> tokens = tok.apply_chat_template(msg, return_tensors='pt')
>>> tokens.shape
torch.Size([1, 27])
>>> tokens[0][:10].tolist()
[100, 101, 115, 99, 114, 105, 98, 101, 32, 116]
>>> tok.decode(tokens[0].tolist())
'describe the students table'

MockTokenizer encodes each character as ord(c) and decodes via chr(t). Deterministic, no downloads, perfect for tests.

6. Schema Introspection

SQLAlchemy ORM models are introspected at runtime to produce natural language schema descriptions:

>>> env._get_table_schema('Student')
Table 'Student' has the following columns:

- student_id: integer number
- student_details: text (up to 255 characters)

>>> env._get_table_schema('NonexistentTable')
Table 'NonexistentTable' not found in schema.

9 tables available: Address, Person, Student, Course, PersonAddress, StudentCourseRegistration, StudentCourseAttendance, Candidate, CandidateAssessment.

7. Full Environment Interaction (Mock Path)

A complete multi-turn episode with no external services:

>>> from server.sql_environment import SQLEnvironment
>>> from server.test_sql_env import MockTokenizer
>>> env = SQLEnvironment(system_prompt='You are a helpful SQL assistant.', tokenizer=MockTokenizer())

>>> obs = env.reset()
>>> obs.messages    # 1 message (system prompt)
>>> obs.tokens.shape
torch.Size([32])
>>> obs.done
False

Turn 1 — Describe:

>>> action = env.message_to_action({'role': 'user', 'content': 'describe the Student table'})
>>> action.action_type
'describe'
>>> obs = env.step(action)
>>> obs.messages[-1]
{'role': 'assistant', 'content': "Table 'Address' has the following columns:\n\n- address_id: integer number\n..."}
>>> obs.tokens.shape
torch.Size([91])

Without Ollama, the describe action falls back to the first table (Address). With Ollama, it would correctly select "Student".

Turn 2 — Sample:

>>> action = env.message_to_action({'role': 'user', 'content': 'show me sample rows from Course'})
>>> action.action_type
'sample'
>>> obs = env.step(action)
>>> obs.messages[-1]['content']
"Here's a query to sample data from Address:\n\nSELECT * FROM Address LIMIT 10;"
>>> obs.tokens.shape
torch.Size([503])

Turn 3 — Query (no Ollama):

>>> action = env.message_to_action({'role': 'user', 'content': 'find all students enrolled in CS101'})
>>> action.action_type
'query'
>>> obs = env.step(action)
>>> obs.messages[-1]['content']
'Error: Ollama returned status 404'
>>> obs.tokens.shape
torch.Size([1028])

The error is graceful — it becomes part of the conversation history. Token tensor grows monotonically across turns (32 -> 91 -> 503 -> 1028).

8. Client Serialization

SQLEnvClient converts tensors to lists for JSON WebSocket transport:

>>> from sql_env.client import SQLEnvClient
>>> from sql_env.models import SQLAction
>>> import torch

>>> action = SQLAction(action_type='query', action_description='select * from students', tokens=torch.tensor([[1, 2, 3, 4, 5]]))
>>> payload = client._step_payload(action)
{
  'action_type': 'query',
  'action_description': 'select * from students',
  'tokens': [[1, 2, 3, 4, 5]],
  'metadata': {}
}

Tensor -> list on send, list -> tensor on receive. Symmetric roundtrip verified in tests.

9. Spider Question Data

>>> import json
>>> data = json.load(open('data/questions/student_assessment.json'))
>>> len(data)
53
>>> data[0]['question']
'which course has most number of registered students?'
>>> data[0]['query']
'SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_Id GROUP BY T1.course_id ORDER BY count(*) DESC LIMIT 1'

53 question-answer pairs from the Spider dataset's student_assessment database. Each entry has db_id, query, question, query_toks, query_toks_no_value, and question_toks.

What Changed from `main`

Area	Before (main)	After (this branch)
Layout	`envs/sql_env/` nested	Flat root = package
Build	hatchling	setuptools
Python	3.13	3.11-3.12 (torch compat)
Models	Structured obs (question, schema, result)	Chat-based obs (messages + tokens)
Action	`argument` field	`action_description` + `tokens` tensor
Environment	Scaffold stubs	Real SQLite + Ollama + keyword dispatch
Client	Basic EnvClient	Tensor <-> list serialization
Data	Empty .gitkeep dirs	9 ORM models + 53 Spider questions
Tests	0	21 (all passing)
Empty dirs	`training_pipeline/`, `submission_artifacts/`	Removed

Known Behaviors (Not Bugs)

Ollama fallback: Without Ollama, _call_ollama_to_select_table() falls back to the first table (Address). Query actions return Error: Ollama returned status 404. This is by design — the mock path is for dev/test, not production.
message_to_action() mutates state: It appends the message to _state.history_messages before tokenizing. This is intentional — the tokenizer needs the full conversation context.
MockTokenizer in production code: server/app.py imports MockTokenizer from server/test_sql_env.py when transformers is unavailable. This is the teammate's design for running without GPU dependencies.

Verification Checklist

uv sync succeeds (all deps install)
uv run pytest tests/ -v — 21/21 pass
uv run ruff check . — all checks passed
uv run ruff format --check . — 14 files formatted
Pydantic models import from sql_env.models
Environment instantiates with MockTokenizer
reset() returns valid SQLObservation with system prompt
Action detection: 9/9 keyword classifications correct
message_to_action() creates typed SQLAction with tokens
step(describe) returns schema from SQLAlchemy introspection
step(sample) returns SQL query text
step(query) returns graceful error without Ollama
Multi-turn conversation state grows correctly
Client tensor <-> list serialization roundtrips
Spider data loads (53 questions)

What's Next

Phase 3: Reward computation (server/reward.py) and answer verification (server/verifier.py). Both are currently stubs.

All output captured live on 2026-02-28. Reproduce with uv sync && uv run pytest tests/ -v.