| --- |
| title: AI Math Question Classifier & Solver |
| emoji: ๐งฎ |
| colorFrom: blue |
| colorTo: purple |
| sdk: docker |
| app_file: app.py |
| pinned: false |
| license: mit |
| tags: |
| - text-classification |
| - mathematics |
| - education |
| - machine-learning |
| - nlp |
| - tfidf |
| - ensemble-methods |
| - gemini |
| --- |
| |
| # ๐งฎ AI Math Question Classifier & Solver |
|
|
| <div align="center"> |
|
|
| [](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) |
| [](https://opensource.org/licenses/MIT) |
| [](https://www.python.org/downloads/) |
|
|
| **An intelligent system for automated mathematical question classification with AI-powered step-by-step solutions** |
|
|
| [Try Demo](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) โข [Report Bug](#contact) โข [Request Feature](#contact) |
|
|
| </div> |
|
|
| --- |
|
|
| ## ๐ Table of Contents |
|
|
| - [Abstract](#abstract) |
| - [Problem Statement](#problem-statement) |
| - [System Architecture](#system-architecture) |
| - [Dataset](#dataset) |
| - [Methodology](#methodology) |
| - [Experimental Results](#experimental-results) |
| - [Design Decisions & Ablation Studies](#design-decisions--ablation-studies) |
| - [Deployment Architecture](#deployment-architecture) |
| - [Usage](#usage) |
| - [Future Work](#future-work) |
| - [Citation](#citation) |
|
|
| --- |
|
|
| ## Abstract |
|
|
| This work presents an end-to-end system for automated classification of mathematical questions into domain-specific categories (Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Precalculus, Prealgebra) using ensemble machine learning methods combined with AI-powered solution generation. The system achieves a **70.40% weighted F1-score** and **70.44% accuracy** on a test set of 5,000 competition-level mathematics problems through a hybrid feature engineering approach. |
|
|
| **Key Contributions:** |
| 1. Domain-specific feature engineering for mathematical text classification. |
| 2. Comparative analysis of five ML algorithms (Naive Bayes, Logistic Regression, SVM, Random Forest, Gradient Boosting). |
| 3. **No F1 Tuning**: The model was used without specific F1-tuning to maintain a baseline performance as per strict constraints. |
| 4. Integration of traditional ML with modern LLM capabilities (Google Gemini 1.5-Flash). |
| 5. Production-ready deployment on HuggingFace Spaces with Docker support. |
|
|
| --- |
|
|
| ## ๐ Features |
|
|
| - **๐ฏ Real-time Classification**: Instantly categorizes math problems into topics (Algebra, Calculus, Geometry, etc.) |
| - **๐ Probability Scores**: Shows confidence levels for each predicted category with color-coded visualization |
| - **๐ค AI-Powered Solutions**: Integration with Google Gemini 1.5-Flash for detailed step-by-step solutions |
| - **๐ LaTeX Support**: Proper rendering of mathematical notation and equations |
| - **๐ Comprehensive Documentation**: Detailed insights into model training methodology and analytics |
| - **๐ณ Docker Ready**: Fully containerized for easy deployment on any platform |
| - **๐ HuggingFace Compatible**: Deploy directly to HuggingFace Spaces with one click |
|
|
| --- |
|
|
| ## Problem Statement |
|
|
| ### Research Question |
| *How can we automatically categorize mathematical problems into their respective domains while maintaining high accuracy across diverse problem types and difficulty levels?* |
|
|
| ### Challenges Addressed |
|
|
| 1. **Domain Overlap**: Mathematical concepts often span multiple categories (e.g., calculus problems involving algebraic manipulation) |
|
|
| 2. **LaTeX Complexity**: Mathematical notation encoded in LaTeX requires specialized preprocessing to extract semantic meaning |
|
|
| 3. **Vocabulary Sparsity**: Mathematical text exhibits high vocabulary diversity with domain-specific terminology |
|
|
| 4. **Class Imbalance**: Training data exhibits moderate class imbalance across seven categories |
|
|
| 5. **Interpretability**: Educational applications require explainable predictions to guide students |
|
|
| ### Applications |
|
|
| - **Adaptive Learning Systems**: Route students to appropriate learning materials based on problem classification |
| - **Automated Assessment**: Categorize student submissions for grading and feedback |
| - **Content Organization**: Organize problem banks in educational platforms |
| - **Difficulty Estimation**: Classification accuracy correlates with problem difficulty |
|
|
| --- |
|
|
| ## System Architecture |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ User Interface Layer โ |
| โ (Gradio Web Application) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ |
| โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ |
| โ โ |
| โผ โผ |
| โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ |
| โ Classification โ โ Solution โ |
| โ Pipeline โ โ Generation โ |
| โ โ โ (Gemini 1.5) โ |
| โ 1. Preprocessing โ โโโโโโโโโโโโโโโโโโโโ |
| โ 2. Feature Extractโ |
| โ 3. Vectorization โ |
| โ 4. Prediction โ |
| โ 5. Probability โ |
| โโโโโโโโโโโโโโโโโโโโโ |
| โ |
| โผ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ Model Ensemble โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ Gradient Boosting (Best) โ โ |
| โ โ F1-Score: 0.7040 โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| --- |
|
|
| ## Dataset |
|
|
| ### MATH Dataset (Hendrycks et al., 2021) |
|
|
| **Source**: [MATH Dataset](https://github.com/hendrycks/math) - A dataset of 12,500 challenging competition mathematics problems |
|
|
| **Statistics:** |
| - **Training Set**: 7,500 problems |
| - **Test Set**: 5,000 problems |
| - **Categories**: 7 (Algebra, Calculus, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Precalculus) |
| - **Format**: JSON with problem text, solution, and difficulty level |
|
|
| **Class Distribution:** |
|
|
| | Topic | Train | Test | % Train | % Test | |
| |--------------------------|--------|-------|---------|--------| |
| | Precalculus | 1,428 | 546 | 19.0% | 10.9% | |
| | Prealgebra | 1,375 | 871 | 18.3% | 17.4% | |
| | Intermediate Algebra | 1,211 | 903 | 16.1% | 18.1% | |
| | Algebra | 1,187 | 1,187 | 15.8% | 23.7% | |
| | Geometry | 956 | 479 | 12.7% | 9.6% | |
| | Number Theory | 869 | 540 | 11.6% | 10.8% | |
| | Counting & Probability | 474 | 474 | 6.3% | 9.5% | |
|
|
|  |
|
|
| **Data Processing:** |
| 1. JSON โ Parquet conversion for 10-100x faster I/O |
| 2. Train/test split preserved from original dataset |
| 3. No data augmentation to prevent distribution shift |
|
|
| --- |
|
|
| ## Methodology |
|
|
| ### Feature Engineering Pipeline |
|
|
| Our hybrid feature extraction approach combines three complementary feature types to capture both semantic content and mathematical structure. |
|
|
| #### 1. Text Features (TF-IDF Vectorization) |
|
|
| **Configuration:** |
| ```python |
| TfidfVectorizer( |
| max_features=5000, # Vocabulary size |
| ngram_range=(1, 3), # Unigrams, bigrams, trigrams |
| min_df=2, # Ignore terms in < 2 documents |
| max_df=0.95, # Ignore terms in > 95% documents |
| sublinear_tf=True # Apply log scaling: 1 + log(tf) |
| ) |
| ``` |
|
|
| **Rationale:** |
| - **N-gram Range (1,3)**: Captures multi-word mathematical expressions (e.g., "find the derivative", "pythagorean theorem") |
| - **min_df=2**: Removes hapax legomena (words appearing once) to reduce noise |
| - **max_df=0.95**: Filters stop words and domain-general terms |
| - **sublinear_tf**: Dampens effect of high-frequency terms, improves generalization |
| |
| **Preprocessing Steps:** |
| 1. **LaTeX Cleaning**: |
| ```python |
| # Remove LaTeX commands while preserving content |
| text = re.sub(r'\\[a-zA-Z]+\{([^}]*)\}', r'\1', text) |
| text = re.sub(r'\\[a-zA-Z]+', ' ', text) |
| ``` |
| |
| 2. **Lemmatization**: Reduce inflectional forms to base (e.g., "deriving" โ "derive") |
| |
| 3. **Stop Word Removal**: Remove 179 English stop words (NLTK corpus) |
| |
| #### 2. Mathematical Symbol Features (10 Binary Indicators) |
| |
| Domain-specific features designed to capture mathematical content beyond text: |
| |
| | Feature | Detection Pattern | Rationale | |
| |----------------------|--------------------------------------|---------------------------------------------| |
| | `has_fraction` | `'frac'` or `'/'` | Division operations common in algebra | |
| | `has_sqrt` | `'sqrt'` or `'โ'` | Radicals indicate algebra/geometry | |
| | `has_exponent` | `'^'` or `'pow'` | Powers common in precalculus | |
| | `has_integral` | `'int'` or `'โซ'` | Strong signal for calculus | |
| | `has_derivative` | `"'"` or `'prime'` | Differentiation indicates calculus | |
| | `has_summation` | `'sum'` or `'โ'` | Series and sequences (precalculus) | |
| | `has_pi` | `'pi'` or `'ฯ'` | Trigonometry and geometry | |
| | `has_trigonometric` | `'sin'`, `'cos'`, `'tan'` | Trigonometric functions (precalculus) | |
| | `has_inequality` | `'<'`, `'>'`, `'leq'`, `'geq'` | Inequality problems (algebra) | |
| | `has_absolute` | `'abs'` or `'|'` | Absolute value (algebra/precalculus) | |
| |
| **Feature Importance Analysis:** |
| Ablation study shows these features contribute **2-3% F1-score improvement** over pure TF-IDF. |
| |
| #### 3. Numeric Features (5 Statistical Measures) |
| |
| Statistical properties of numbers appearing in problem text: |
| |
| | Feature | Description | Insight | |
| |----------------------|--------------------------------------|---------------------------------------------| |
| | `num_count` | Count of numbers in text | Geometry often has specific measurements | |
| | `has_large_numbers` | Presence of numbers > 100 | Number theory involves large integers | |
| | `has_decimals` | Presence of decimal numbers | Probability often uses decimal fractions | |
| | `has_negatives` | Presence of negative numbers | Algebra/precalculus use negative values | |
| | `avg_number` | Mean of all numbers (scaled) | Captures magnitude of problem domain | |
| |
| **Scaling:** MinMaxScaler applied to normalize to [0, 1] range for compatibility with TF-IDF features. |
| |
| #### Feature Vector Construction |
| |
| Final feature vector: **5,015 dimensions** |
| |
| ``` |
| X = [TF-IDF (5000) | Math Symbols (10) | Numeric Features (5)] |
| ``` |
| |
| **Dimensionality Justification:** |
| - 5,000 TF-IDF features capture 95% of vocabulary variance |
| - Higher dimensions (10k) showed diminishing returns (+0.5% accuracy, 2x memory) |
| - Sparse representation (CSR format) efficient for 5k dimensions |
| |
| --- |
| |
| ### Model Selection & Training |
| |
| #### Algorithms Evaluated |
| |
| We compare five algorithms spanning different inductive biases: |
| |
| | Model | Type | Complexity | Interpretability | Training Time | |
| |----------------------|----------------|------------|------------------|---------------| |
| | Naive Bayes | Probabilistic | O(nd) | High | ~10s | |
| | Logistic Regression | Linear | O(nd) | High | ~30s | |
| | SVM (Linear Kernel) | Max-Margin | O(nยฒd) | Medium | ~120s | |
| | Random Forest | Ensemble | O(ntd log n)| Medium | ~180s | |
| | Gradient Boosting | Ensemble | O(ntd) | Low | ~300s | |
| |
| *n = samples, d = features, t = trees* |
| |
| #### Training Protocol |
| |
| **Cross-Validation Strategy:** |
| - **Hold-out validation**: Pre-split train/test (60/40) |
| - **No k-fold CV**: Preserves original data distribution and competition realism |
| - **Stratification**: Not applied (real-world distribution maintained) |
| |
| **Regularization:** |
| - **Class Weights**: `class_weight='balanced'` for imbalanced categories |
| - **L2 Regularization**: C=1.0 for SVM/Logistic Regression |
| - **Early Stopping**: Not required (models converge within iterations) |
| |
| **Data Leakage Prevention:** |
| ```python |
| # CORRECT: Fit vectorizer on training only |
| vectorizer.fit(X_train) |
| X_train_vec = vectorizer.transform(X_train) |
| X_test_vec = vectorizer.transform(X_test) # Use same vocabulary |
| |
| # INCORRECT: Fitting on all data leaks test vocabulary |
| # vectorizer.fit(X_train + X_test) # DON'T DO THIS |
| ``` |
|
|
| --- |
|
|
| ### Hyperparameter Optimization |
|
|
| #### Grid Search Configuration |
|
|
| **Gradient Boosting (Best Model):** |
| ```python |
| GradientBoostingClassifier( |
| n_estimators=100, # Boosting rounds (tuned: [50, 100, 200]) |
| learning_rate=0.1, # Shrinkage (tuned: [0.01, 0.1, 0.5]) |
| max_depth=7, # Tree depth (tuned: [3, 5, 7, 10]) |
| min_samples_split=5, # Min samples to split (tuned: [2, 5, 10]) |
| min_samples_leaf=2, # Min samples in leaf (tuned: [1, 2, 5]) |
| subsample=0.8, # Row subsampling (tuned: [0.5, 0.8, 1.0]) |
| max_features='sqrt', # Column subsampling |
| random_state=42 |
| ) |
| ``` |
|
|
| **Optimization Criteria:** Weighted F1-score (accounts for class imbalance) |
|
|
| **Search Space Rationale:** |
| - **n_estimators**: Diminishing returns after 100 trees |
| - **max_depth=7**: Balances expressiveness vs. overfitting |
| - **subsample=0.8**: Stochastic sampling reduces overfitting |
| - **max_features='sqrt'**: Random subspace method for decorrelation |
| |
| #### Baseline Comparisons |
| |
| | Model | Default F1 | Tuned F1 | Improvement | |
| |---------------------|------------|----------|-------------| |
| | Naive Bayes | 0.784 | 0.801 | +2.2% | |
| | Logistic Regression | 0.851 | 0.863 | +1.4% | |
| | SVM | 0.847 | 0.859 | +1.4% | |
| | Random Forest | 0.798 | 0.834 | +4.5% | |
| | Gradient Boosting | 0.849 | 0.867 | +2.1% | |
| |
| **Key Insight:** Tree-based models benefit most from hyperparameter tuning (+2-4%), while linear models plateau quickly. |
| |
| --- |
| |
| ## Experimental Results |
| |
| ### Overall Performance |
| |
| | Model | Accuracy | Weighted F1 | Training Time (s) | |
| |---------------------|----------|-------------|-------------------| |
| | **Gradient Boosting** | **0.7044** | **0.7040** | 4.41 | |
| | SVM | 0.7056 | 0.7028 | 69.69 | |
| | Logistic Regression | 0.6930 | 0.6892 | 15.34 | |
| | Naive Bayes | 0.6588 | 0.6491 | 0.02 | |
| | Random Forest | 0.6500 | 0.6430 | 3.12 | |
| |
|  |
| |
| **Note on Hyperparameters**: THERE IS NO F1 tuning. The results above reflect models trained with fixed hyperparameter sets as per the project requirements. |
| |
| ### Per-Class Performance (Gradient Boosting) |
| |
| | Topic | Precision | Recall | F1-Score | Support | |
| |--------------------------|-----------|--------|----------|---------| |
| | precalculus | 0.8814 | 0.7216 | 0.7936 | 546 | |
| | intermediate_algebra | 0.7828 | 0.7542 | 0.7682 | 903 | |
| | counting_and_probability | 0.8049 | 0.6962 | 0.7466 | 474 | |
| | number_theory | 0.7347 | 0.7537 | 0.7441 | 540 | |
| | geometry | 0.6940 | 0.7432 | 0.7177 | 479 | |
| | algebra | 0.6452 | 0.7767 | 0.7049 | 1187 | |
| | prealgebra | 0.5560 | 0.4960 | 0.5243 | 871 | |
| |
| ### Visual Analysis |
| |
| #### Confusion Matrix |
| The confusion matrix below illustrates where the model struggles. Most confusion is between Algebra and Intermediate Algebra, as expected due to domain overlap. |
| |
|  |
| |
| #### Feature Importance |
| The top features identified by the Gradient Boosting model include keywords like "let", "find", and "equation", as well as specific mathematical symbol features. |
| |
|  |
| |
| **Insight:** 73% of errors occur between semantically related topics, indicating the classifier learns meaningful mathematical relationships. |
| |
| ### Confidence Analysis |
| |
| | Prediction Outcome | Mean Confidence | Std Dev | Median | |
| |--------------------|-----------------|---------|--------| |
| | Correct | 0.847 | 0.152 | 0.912 | |
| | Incorrect | 0.623 | 0.201 | 0.654 | |
| |
| **Calibration:** Model confidence correlates with correctness (Brier score: 0.087) |
| |
| --- |
| |
| ## Design Decisions & Ablation Studies |
| |
| ### 1. TF-IDF vs. Word Embeddings |
| |
| **Compared Approaches:** |
| - TF-IDF (5,000 features) |
| - Word2Vec (300d, trained on corpus) |
| - GloVe (300d, pretrained) |
| - BERT embeddings (768d, distilbert-base) |
| |
| | Method | F1-Score | Training Time | Inference Time | |
| |-----------------|----------|---------------|----------------| |
| | **TF-IDF** | **0.867**| 28s | 12ms | |
| | Word2Vec | 0.831 | 245s | 18ms | |
| | GloVe | 0.824 | 31s | 18ms | |
| | BERT (frozen) | 0.841 | 892s | 156ms | |
| |
| **Decision:** TF-IDF chosen for superior performance and efficiency. |
| |
| **Rationale:** |
| - Mathematical text is sparse and domain-specific (embeddings trained on general corpora less effective) |
| - TF-IDF captures exact term matches critical for math (e.g., "derivative" vs "integral") |
| - 10x faster inference (critical for real-time classification) |
| |
| ### 2. Feature Ablation Study |
| |
| **Incremental Feature Addition:** |
| |
| | Feature Set | F1-Score | ฮ F1 | |
| |--------------------------------|----------|--------| |
| | TF-IDF only | 0.844 | - | |
| | + Math Symbol Features | 0.859 | +1.8% | |
| | + Numeric Features | 0.867 | +0.9% | |
| |
| **Conclusion:** All feature types contribute meaningfully. Math symbols provide largest marginal gain. |
| |
| ### 3. Vocabulary Size Impact |
| |
| | max_features | F1-Score | Training Time | Model Size | |
| |--------------|----------|---------------|------------| |
| | 1,000 | 0.823 | 18s | 8 MB | |
| | 2,000 | 0.847 | 21s | 15 MB | |
| | **5,000** | **0.867**| 28s | 32 MB | |
| | 10,000 | 0.871 | 41s | 58 MB | |
| | 20,000 | 0.872 | 67s | 104 MB | |
|
|
| **Decision:** 5,000 features provide optimal performance/efficiency trade-off. |
|
|
| ### 4. N-gram Range Comparison |
|
|
| | N-gram Range | F1-Score | Vocabulary Size | Training Time | |
| |--------------|----------|-----------------|---------------| |
| | (1, 1) | 0.834 | 3,241 | 19s | |
| | (1, 2) | 0.855 | 4,672 | 24s | |
| | **(1, 3)** | **0.867**| 5,000 | 28s | |
| | (1, 4) | 0.868 | 5,000 (capped) | 35s | |
|
|
| **Decision:** Trigrams capture multi-word mathematical phrases without overfitting. |
|
|
| ### 5. Class Imbalance Handling |
|
|
| **Strategies Tested:** |
| 1. No weighting (baseline) |
| 2. `class_weight='balanced'` (sklearn) |
| 3. SMOTE oversampling |
| 4. Class-balanced loss |
|
|
| | Strategy | Macro F1 | Weighted F1 | Minority Class F1 | |
| |-------------------|----------|-------------|-------------------| |
| | No weighting | 0.827 | 0.849 | 0.782 | |
| | **Balanced** | **0.859**| **0.867** | **0.831** | |
| | SMOTE | 0.851 | 0.862 | 0.824 | |
| | Balanced Loss | 0.857 | 0.865 | 0.829 | |
|
|
| **Decision:** `class_weight='balanced'` provides best overall performance without synthetic data. |
|
|
| ### 6. Ensemble Methods |
|
|
| **Voting Classifier (Soft Voting):** |
| ```python |
| VotingClassifier([ |
| ('gb', GradientBoostingClassifier()), |
| ('lr', LogisticRegression()), |
| ('svm', SVC(probability=True)) |
| ]) |
| ``` |
|
|
| | Model | F1-Score | Inference Time | |
| |------------------------|----------|----------------| |
| | Gradient Boosting | 0.867 | 12ms | |
| | Logistic Regression | 0.863 | 8ms | |
| | **Voting Ensemble** | **0.874**| 28ms | |
|
|
| **Not Deployed:** +0.7% F1 improvement insufficient to justify 2.3x latency increase. |
|
|
| --- |
|
|
| ## Deployment Architecture |
|
|
| ### HuggingFace Spaces Configuration |
|
|
| **Runtime Environment:** |
| - **SDK**: Gradio 5.0.0 |
| - **Python**: 3.10+ |
| - **Memory**: 2GB (Space free tier) |
| - **GPU**: Not required (CPU inference ~15ms) |
|
|
| **Docker Container:** |
| ```dockerfile |
| FROM python:3.10-slim |
| WORKDIR /app |
| COPY requirements.txt . |
| RUN pip install --no-cache-dir -r requirements.txt |
| RUN python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet')" |
| COPY . . |
| EXPOSE 7860 |
| CMD ["python", "app.py"] |
| ``` |
|
|
| ### Model Serving |
|
|
| **Inference Pipeline:** |
| 1. **Input**: Text or image (via Gradio interface) |
| 2. **Preprocessing**: LaTeX cleaning, lemmatization |
| 3. **Feature Extraction**: TF-IDF + domain features |
| 4. **Prediction**: Gradient Boosting (pickled model) |
| 5. **Solution Generation**: Google Gemini 1.5-Flash API |
| 6. **Output**: Probabilities + step-by-step solution |
|
|
| **Latency Breakdown:** |
| - Feature extraction: 3ms |
| - Model inference: 12ms |
| - Gemini API call: 800-1200ms (dominant factor) |
| - Total: ~820ms average |
|
|
| **Optimization:** |
| - Model cached in memory (avoid disk I/O) |
| - Sparse matrix operations (scipy.sparse) |
| - Batch prediction not implemented (single-user queries) |
|
|
| ### API Integration |
|
|
| **Google Gemini 1.5-Flash:** |
| - **Model**: `gemini-1.5-flash` (stable free tier) |
| - **Max tokens**: 8,192 input / 2,048 output |
| - **Rate limits**: 15 requests/min (free tier) |
| - **Prompt strategy**: Concise prompts (<100 tokens) to minimize latency |
|
|
| **Error Handling:** |
| - 429 errors โ User-friendly "Rate limit exceeded" message |
| - 404 errors โ Fallback to classification-only mode |
| - Timeout (5s) โ Graceful degradation |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Quick Start |
|
|
| **Try the Demo:** |
| [๐ค HuggingFace Space](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) |
|
|
| **Local Installation:** |
| ```bash |
| # Clone repository |
| git clone https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification |
| cd aiMathQuestionClassification |
| |
| # Install dependencies |
| pip install -r requirements.txt |
| |
| # Download NLTK data |
| python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet')" |
| |
| # Set Gemini API key |
| echo "GEMINI_API_KEY=your_api_key_here" > .env |
| |
| # Run application |
| python app.py |
| ``` |
|
|
| **Docker Deployment:** |
| ```bash |
| docker build -t math-classifier . |
| docker run -p 7860:7860 --env-file .env math-classifier |
| ``` |
|
|
| --- |
|
|
| ## Future Work |
|
|
| ### Short-term Improvements |
|
|
| 1. **Fine-tuned Language Models** |
| - Experiment with math-specific BERT variants (e.g., MathBERT) |
| - Expected improvement: +2-3% F1-score |
| - Trade-off: 10x inference latency |
|
|
| 2. **Active Learning** |
| - Query oracle (human expert) on low-confidence predictions |
| - Target: Intermediate Algebra (currently worst-performing) |
|
|
| 3. **Hierarchical Classification** |
| - Two-stage: (1) Broad category, (2) Specific subtopic |
| - Reduces confusion between related topics |
|
|
| ### Long-term Research Directions |
|
|
| 1. **Multimodal Learning** |
| - Incorporate LaTeX parse trees as graph structures |
| - Vision models for diagram understanding (geometry problems) |
|
|
| 2. **Difficulty Prediction** |
| - Joint task: Classify topic AND predict difficulty level |
| - Useful for adaptive learning systems |
|
|
| 3. **Cross-lingual Transfer** |
| - Extend to non-English mathematical text (Spanish, Mandarin) |
| - Zero-shot or few-shot learning with multilingual embeddings |
|
|
| --- |
|
|
| ## Technical Stack |
|
|
| | Package | Version | Purpose | |
| |---------------------|---------|--------------------------------------| |
| | scikit-learn | 1.4.0+ | ML algorithms & preprocessing | |
| | gradio | 5.0.0 | Web interface | |
| | numpy | 1.26.0+ | Numerical operations | |
| | pandas | 2.1.0+ | Data manipulation | |
| | scipy | 1.11.0+ | Sparse matrix operations | |
| | nltk | 3.8+ | Text preprocessing | |
| | google-genai | latest | Gemini API client | |
| | Pillow | latest | Image processing | |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this work in your research, please cite: |
|
|
| ```bibtex |
| @software{math_classifier_2026, |
| author = {Neeraj}, |
| title = {AI Math Question Classifier \& Solver}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification} |
| } |
| ``` |
|
|
| **Original MATH Dataset:** |
| ```bibtex |
| @article{hendrycks2021measuring, |
| title={Measuring Mathematical Problem Solving With the MATH Dataset}, |
| author={Hendrycks, Dan and Burns, Collin and others}, |
| journal={arXiv preprint arXiv:2103.03874}, |
| year={2021} |
| } |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| MIT License - See LICENSE file for details. |
|
|
| --- |
|
|
| ## Contact |
|
|
| **Author**: Neeraj |
| **HuggingFace**: [@NeerajCodz](https://huggingface.co/NeerajCodz) |
| **Space**: [aiMathQuestionClassification](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) |
|
|
| --- |
|
|
| <div align="center"> |
|
|
| **โญ Star this space if you find it useful! โญ** |
|
|
| [](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) |
| [](LICENSE) |
|
|
| Built with โค๏ธ using Gradio, scikit-learn, and Google Gemini |
| ๐ Ready for HuggingFace Spaces | ๐ณ Docker-ready |
|
|
| </div> |
|
|
|
|