Everything you need to know before uploading your first dataset to Wood Wide AI. Wood Wide AI works with tabular data; rows and columns, like a spreadsheet or database export. This guide walks you through getting your data ready so training and inference go smoothly on the first try.Documentation Index
Fetch the complete documentation index at: https://docs.woodwide.ai/llms.txt
Use this file to discover all available pages before exploring further.
Supported Formats
CSV
The most common format that works with Excel, Google Sheets, and every database export tool.
Parquet
Preferred for large datasets with smaller file size and faster uploads.\
Data Structure Checklist
Before uploading, walk through this checklist. If you can check every box, your data is ready.1. One row per record
1. One row per record
| Right | Wrong |
|---|---|
| One row per customer per month | Multiple months comma-separated in one cell |
| One row per transaction | Summary rows mixed in with detail rows |
| One row per device reading | Blank rows used as section dividers |
2. One column per attribute
2. One column per attribute
| Right | Wrong |
|---|---|
Separate city and state columns | San Francisco, CA in one location column |
Separate first_name and last_name | John Smith in one name column |
revenue as a standalone number | $1,234.56 with currency symbols and commas |
3. A header row with clear column names
3. A header row with clear column names
- Use
snake_caseor plain words:monthly_revenue,signup_date,customer_segment - Avoid special characters, leading/trailing spaces, or duplicate column names
- Don’t leave any column name blank
4. No formatting, formulas, or merged cells
4. No formatting, formulas, or merged cells
- Currency symbols (
$,EUR) — keep the number only - Percentage signs (
%) — use0.15instead of15% - Commas in numbers —
1234567not1,234,567 - Excel formulas — copy-paste as values first
- Merged cells — unmerge and fill each cell
- Summary/total rows at the bottom
5. Consistent values within each column
5. Consistent values within each column
| Column | Consistent | Inconsistent |
|---|---|---|
status | active, churned, trial | Active, ACTIVE, active, 1 |
date | 2025-03-15 throughout | Mix of 3/15/25, March 15, 2025, 2025-03-15 |
revenue | 14500.00 throughout | Mix of $14,500, 14500, 14.5K |
Column Types
Wood Wide AI automatically infers the data type of each column when you upload a dataset. You don’t need to declare types manually. Here’s what gets detected:| Inferred Type | What It Looks Like | Examples |
|---|---|---|
| Numeric | Numbers (integers or decimals) | 42, 3.14, -100, 0.001 |
| Categorical | Text labels or codes with repeated values | enterprise, smb, US, tier_1 |
| Datetime | Dates and timestamps | 2025-03-15, 2025-03-15T09:30:00Z |
| Binary | Two-value columns | 0/1, true/false, yes/no |
Handling Common Issues
Missing Values
Missing data is normal. Leave cells empty or use blank values. Don’t fill them with placeholders likeN/A, null, none, 0, or -1, which will be treated as real values.
| Do this ✔️ | Not this 🚫 |
|---|---|
| Leave the cell empty | Fill with N/A or null |
| Leave the cell empty | Fill with 0 (unless 0 is a real value) |
| Leave the cell empty | Fill with -999 or any sentinel value |
Duplicate Rows
Check for and remove exact duplicate rows before uploading. Duplicates can skew model training; the model will over-weight those patterns.ID Columns
Columns likerow_id, customer_id, or transaction_id are unique per row and don’t carry predictive signal. You can leave them in (they won’t hurt) but for cleaner results, consider removing pure ID columns before upload.
High-Cardinality Text
Columns where almost every value is unique (like free-text notes, email addresses, or URLs) don’t work well as features. They look categorical but have no repeating patterns for the model to learn from. Remove or replace them with something structured.| Remove or transform | Keep |
|---|---|
customer_email (unique per row) | email_domain (repeating: gmail.com, company.com) |
free_text_notes | note_length or has_notes (true/false) |
full_address | city, state, zip as separate columns |
Preparing Data by Task
Different model types work best with different data shapes. Here’s what to keep in mind for each.Prediction
You need a target column → the thing you want to predict. This is specified aslabel_column when training.
- Classification (categorical target): The target column should contain discrete categories like
churned/retained,high/medium/low, orapproved/denied. - Regression (numeric target): The target column should contain continuous numbers like revenue, score, or duration.
Clustering
No target column needed. Include the columns that describe the attributes you want to group by. If you want behavioral clusters, include behavioral columns (usage frequency, spend patterns, engagement metrics). If you want firmographic clusters, include firmographic columns (industry, size, region).Anomaly Detection
No target column needed. Include columns that represent “normal” behavior. The model learns what normal looks like and flags rows that deviate. More columns describing typical patterns = better anomaly detection.Factor Analysis
No target column needed. Include all columns you suspect might share underlying patterns. Factor analysis discovers the hidden structure that explains why your columns move together.Pre-Upload Quick Check
Run through this before every upload:Consistent types
What Happens After Upload
Once your data is uploaded, Wood Wide AI:- Infers the schema: detects column names and types (
numeric,categorical,datetime,binary) automatically. - Versions your dataset: every upload creates a new version, so you can always go back.
- Handles inference alignment: when you run inference later, the system automatically aligns your new data to the training schema. Extra columns are dropped, missing columns are filled with nulls, and type mismatches are coerced where possible.
AI-Assisted Cleanup Prompts
If you’re not sure how to fix a data issue, paste your CSV into any LLM (like ChatGPT or Claude) along with one of these prompts. Each one targets a specific cleanup task.General health check
General health check
Standardize categorical values
Standardize categorical values
Clean numeric columns
Clean numeric columns
Standardize dates
Standardize dates
Remove placeholder missing values
Remove placeholder missing values
Remove duplicates and junk rows
Remove duplicates and junk rows
Reduce high-cardinality columns
Reduce high-cardinality columns
Prepare a prediction target column
Prepare a prediction target column
Full cleanup → do everything
Full cleanup → do everything
Exporting from Common Tools
Google Sheets
Google Sheets
Microsoft Excel
Microsoft Excel
Salesforce
Salesforce
$ or locale-specific symbols, and picklist fields may contain semicolon-delimited multi-values. Clean these before upload.SQL / Database Export
SQL / Database Export
NULL values export as empty strings, not the literal text NULL.Python (pandas)
Python (pandas)
index=False to avoid adding an extra index column.Dataset FAQs
How many rows do I need?
How many rows do I need?
How many columns can I have?
How many columns can I have?
Can I include dates?
Can I include dates?
YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ for best results.What if my inference data has different columns than my training data?
What if my inference data has different columns than my training data?
Can I update my dataset later?
Can I update my dataset later?
What if I have more than 30 MB of data?
What if I have more than 30 MB of data?