Demo: SQLEnv β Flat OpenEnv Environment with Action Dispatch
Generated: 2026-02-28T14:26Z Branch:
refactor-openenv-tutorial-project-structure@f28bfaaEnvironment: Python 3.12.3, torch 2.2.2, MockTokenizer (no Ollama required)
What This Branch Does
This branch refactors the sql-env project from a nested envs/sql_env/ layout into the canonical flat openenv init structure, and integrates the action-feature branch's core action dispatch system.
The result: a working RL environment where an agent sends natural language messages (e.g. "describe the students table"), the environment classifies them into action types (describe/sample/query), dispatches to the appropriate handler, and returns tokenized observations for RL training. All of this runs without external services β MockTokenizer replaces HuggingFace tokenizers and Ollama failures are handled gracefully.
Quickstart
git checkout refactor-openenv-tutorial-project-structure
uv sync
uv run pytest tests/ -v # 21 tests, ~3.5s
Prerequisites: Python 3.11-3.12, uv.
Optional: Ollama with llama3.2 for LLM-guided table selection (not needed for demo).
Evidence
1. All 21 Tests Pass
$ uv run pytest tests/ -v
tests/test_smoke.py::TestModels::test_action_creation PASSED [ 4%]
tests/test_smoke.py::TestModels::test_action_with_tokens PASSED [ 9%]
tests/test_smoke.py::TestModels::test_observation_creation PASSED [ 14%]
tests/test_smoke.py::TestModels::test_state_creation PASSED [ 19%]
tests/test_smoke.py::TestEnvironment::test_instantiation PASSED [ 23%]
tests/test_smoke.py::TestEnvironment::test_reset_returns_observation PASSED [ 28%]
tests/test_smoke.py::TestEnvironment::test_reset_with_empty_prompt PASSED [ 33%]
tests/test_smoke.py::TestEnvironment::test_reset_creates_new_episode PASSED [ 38%]
tests/test_smoke.py::TestEnvironment::test_step_describe PASSED [ 42%]
tests/test_smoke.py::TestEnvironment::test_step_sample PASSED [ 47%]
tests/test_smoke.py::TestEnvironment::test_tokens_grow_across_turns PASSED [ 52%]
tests/test_smoke.py::TestActionDetection::test_describe_keywords PASSED [ 57%]
tests/test_smoke.py::TestActionDetection::test_sample_keywords PASSED [ 61%]
tests/test_smoke.py::TestActionDetection::test_query_default PASSED [ 66%]
tests/test_smoke.py::TestMessageToAction::test_creates_action PASSED [ 71%]
tests/test_smoke.py::TestMessageToAction::test_appends_to_history PASSED [ 76%]
tests/test_smoke.py::TestMessageToAction::test_validates_input PASSED [ 80%]
tests/test_smoke.py::TestClientSerialization::test_step_payload_serialization PASSED [ 85%]
tests/test_smoke.py::TestClientSerialization::test_parse_result_deserialization PASSED [ 90%]
tests/test_smoke.py::TestSchemaIntrospection::test_get_table_schema PASSED [ 95%]
tests/test_smoke.py::TestSchemaIntrospection::test_unknown_table PASSED [100%]
============================== 21 passed in 3.56s ==============================
Tests cover: Pydantic models, environment lifecycle, action detection, message-to-action conversion, client tensor serialization, and schema introspection.
2. Lint and Format Clean
$ uv run ruff check .
All checks passed!
$ uv run ruff format --check .
14 files already formatted
3. Pydantic Model Contracts
>>> from sql_env.models import SQLAction, SQLObservation, SQLState
SQLAction fields: ['metadata', 'action_type', 'action_description', 'tokens']
SQLObservation fields: ['done', 'reward', 'metadata', 'messages', 'tokens']
SQLState fields: ['episode_id', 'step_count', 'history_messages', 'history_tokens', 'current_action_type']
SQLAction.tokens and SQLObservation.tokens carry torch tensors. SQLState.history_messages / history_tokens accumulate the full conversation for RL context.
4. Action Type Detection
The environment classifies natural language messages into action types via keyword matching:
[PASS] "describe the students table..." -> describe
[PASS] "what columns does Course have..." -> describe
[PASS] "show me the schema..." -> describe
[PASS] "show me sample rows from students..." -> sample
[PASS] "give me example data..." -> sample
[PASS] "how many rows are in Courses..." -> sample
[PASS] "find all students enrolled in CS101..." -> query
[PASS] "select count(*) from students..." -> query
[PASS] "what is the average score..." -> query
Keywords like "describe"/"schema"/"columns" trigger describe; "sample"/"example"/"rows" trigger sample; everything else defaults to query.
5. MockTokenizer Roundtrip
>>> from server.test_sql_env import MockTokenizer
>>> tok = MockTokenizer()
>>> msg = [{'role': 'user', 'content': 'describe the students table'}]
>>> tokens = tok.apply_chat_template(msg, return_tensors='pt')
>>> tokens.shape
torch.Size([1, 27])
>>> tokens[0][:10].tolist()
[100, 101, 115, 99, 114, 105, 98, 101, 32, 116]
>>> tok.decode(tokens[0].tolist())
'describe the students table'
MockTokenizer encodes each character as ord(c) and decodes via chr(t). Deterministic, no downloads, perfect for tests.
6. Schema Introspection
SQLAlchemy ORM models are introspected at runtime to produce natural language schema descriptions:
>>> env._get_table_schema('Student')
Table 'Student' has the following columns:
- student_id: integer number
- student_details: text (up to 255 characters)
>>> env._get_table_schema('NonexistentTable')
Table 'NonexistentTable' not found in schema.
9 tables available: Address, Person, Student, Course, PersonAddress, StudentCourseRegistration, StudentCourseAttendance, Candidate, CandidateAssessment.
7. Full Environment Interaction (Mock Path)
A complete multi-turn episode with no external services:
>>> from server.sql_environment import SQLEnvironment
>>> from server.test_sql_env import MockTokenizer
>>> env = SQLEnvironment(system_prompt='You are a helpful SQL assistant.', tokenizer=MockTokenizer())
>>> obs = env.reset()
>>> obs.messages # 1 message (system prompt)
>>> obs.tokens.shape
torch.Size([32])
>>> obs.done
False
Turn 1 β Describe:
>>> action = env.message_to_action({'role': 'user', 'content': 'describe the Student table'})
>>> action.action_type
'describe'
>>> obs = env.step(action)
>>> obs.messages[-1]
{'role': 'assistant', 'content': "Table 'Address' has the following columns:\n\n- address_id: integer number\n..."}
>>> obs.tokens.shape
torch.Size([91])
Without Ollama, the describe action falls back to the first table (Address). With Ollama, it would correctly select "Student".
Turn 2 β Sample:
>>> action = env.message_to_action({'role': 'user', 'content': 'show me sample rows from Course'})
>>> action.action_type
'sample'
>>> obs = env.step(action)
>>> obs.messages[-1]['content']
"Here's a query to sample data from Address:\n\nSELECT * FROM Address LIMIT 10;"
>>> obs.tokens.shape
torch.Size([503])
Turn 3 β Query (no Ollama):
>>> action = env.message_to_action({'role': 'user', 'content': 'find all students enrolled in CS101'})
>>> action.action_type
'query'
>>> obs = env.step(action)
>>> obs.messages[-1]['content']
'Error: Ollama returned status 404'
>>> obs.tokens.shape
torch.Size([1028])
The error is graceful β it becomes part of the conversation history. Token tensor grows monotonically across turns (32 -> 91 -> 503 -> 1028).
8. Client Serialization
SQLEnvClient converts tensors to lists for JSON WebSocket transport:
>>> from sql_env.client import SQLEnvClient
>>> from sql_env.models import SQLAction
>>> import torch
>>> action = SQLAction(action_type='query', action_description='select * from students', tokens=torch.tensor([[1, 2, 3, 4, 5]]))
>>> payload = client._step_payload(action)
{
'action_type': 'query',
'action_description': 'select * from students',
'tokens': [[1, 2, 3, 4, 5]],
'metadata': {}
}
Tensor -> list on send, list -> tensor on receive. Symmetric roundtrip verified in tests.
9. Spider Question Data
>>> import json
>>> data = json.load(open('data/questions/student_assessment.json'))
>>> len(data)
53
>>> data[0]['question']
'which course has most number of registered students?'
>>> data[0]['query']
'SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_Id GROUP BY T1.course_id ORDER BY count(*) DESC LIMIT 1'
53 question-answer pairs from the Spider dataset's student_assessment database. Each entry has db_id, query, question, query_toks, query_toks_no_value, and question_toks.
What Changed from main
| Area | Before (main) | After (this branch) |
|---|---|---|
| Layout | envs/sql_env/ nested |
Flat root = package |
| Build | hatchling | setuptools |
| Python | 3.13 | 3.11-3.12 (torch compat) |
| Models | Structured obs (question, schema, result) | Chat-based obs (messages + tokens) |
| Action | argument field |
action_description + tokens tensor |
| Environment | Scaffold stubs | Real SQLite + Ollama + keyword dispatch |
| Client | Basic EnvClient | Tensor <-> list serialization |
| Data | Empty .gitkeep dirs | 9 ORM models + 53 Spider questions |
| Tests | 0 | 21 (all passing) |
| Empty dirs | training_pipeline/, submission_artifacts/ |
Removed |
Known Behaviors (Not Bugs)
Ollama fallback: Without Ollama,
_call_ollama_to_select_table()falls back to the first table (Address). Query actions returnError: Ollama returned status 404. This is by design β the mock path is for dev/test, not production.message_to_action()mutates state: It appends the message to_state.history_messagesbefore tokenizing. This is intentional β the tokenizer needs the full conversation context.MockTokenizerin production code:server/app.pyimportsMockTokenizerfromserver/test_sql_env.pywhentransformersis unavailable. This is the teammate's design for running without GPU dependencies.
Verification Checklist
-
uv syncsucceeds (all deps install) -
uv run pytest tests/ -vβ 21/21 pass -
uv run ruff check .β all checks passed -
uv run ruff format --check .β 14 files formatted - Pydantic models import from
sql_env.models - Environment instantiates with MockTokenizer
-
reset()returns valid SQLObservation with system prompt - Action detection: 9/9 keyword classifications correct
-
message_to_action()creates typed SQLAction with tokens -
step(describe)returns schema from SQLAlchemy introspection -
step(sample)returns SQL query text -
step(query)returns graceful error without Ollama - Multi-turn conversation state grows correctly
- Client tensor <-> list serialization roundtrips
- Spider data loads (53 questions)
What's Next
Phase 3: Reward computation (server/reward.py) and answer verification (server/verifier.py). Both are currently stubs.
All output captured live on 2026-02-28. Reproduce with uv sync && uv run pytest tests/ -v.