Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.woodwide.ai/llms.txt

Use this file to discover all available pages before exploring further.

Everything you need to know before uploading your first dataset to Wood Wide AI. Wood Wide AI works with tabular data; rows and columns, like a spreadsheet or database export. This guide walks you through getting your data ready so training and inference go smoothly on the first try.

Supported Formats

CSV

Comma-separated values.
The most common format that works with Excel, Google Sheets, and every database export tool.

Parquet

Columnar storage format.
Preferred for large datasets with smaller file size and faster uploads.\
Direct uploads are limited to 30 MB. For larger files, Wood Wide AI provides a signed-URL upload flow that handles files of any size. See Large File Uploads for details.

Data Structure Checklist

Before uploading, walk through this checklist. If you can check every box, your data is ready.
Each row should represent one observation, transaction, customer, event, or time period. Don’t nest multiple records into a single row or use merged cells.
RightWrong
One row per customer per monthMultiple months comma-separated in one cell
One row per transactionSummary rows mixed in with detail rows
One row per device readingBlank rows used as section dividers
Each column should contain one type of information. Don’t combine multiple values into a single column.
RightWrong
Separate city and state columnsSan Francisco, CA in one location column
Separate first_name and last_nameJohn Smith in one name column
revenue as a standalone number$1,234.56 with currency symbols and commas
The first row must contain column names. Keep them short, descriptive, and consistent.Tips:
  • Use snake_case or plain words: monthly_revenue, signup_date, customer_segment
  • Avoid special characters, leading/trailing spaces, or duplicate column names
  • Don’t leave any column name blank
If you’re exporting from a spreadsheet, strip everything back to raw values before saving as CSV.Remove:
  • Currency symbols ($, EUR) — keep the number only
  • Percentage signs (%) — use 0.15 instead of 15%
  • Commas in numbers — 1234567 not 1,234,567
  • Excel formulas — copy-paste as values first
  • Merged cells — unmerge and fill each cell
  • Summary/total rows at the bottom
Every value in a column should follow the same format. Mixed formats confuse schema inference.
ColumnConsistentInconsistent
statusactive, churned, trialActive, ACTIVE, active, 1
date2025-03-15 throughoutMix of 3/15/25, March 15, 2025, 2025-03-15
revenue14500.00 throughoutMix of $14,500, 14500, 14.5K

Column Types

Wood Wide AI automatically infers the data type of each column when you upload a dataset. You don’t need to declare types manually. Here’s what gets detected:
Inferred TypeWhat It Looks LikeExamples
NumericNumbers (integers or decimals)42, 3.14, -100, 0.001
CategoricalText labels or codes with repeated valuesenterprise, smb, US, tier_1
DatetimeDates and timestamps2025-03-15, 2025-03-15T09:30:00Z
BinaryTwo-value columns0/1, true/false, yes/no
When in doubt, keep it simple. If a column has numbers, make sure every value is actually a number (no text mixed in). If it’s a category, make sure the same category is always spelled the same way.

Handling Common Issues

Missing Values

Missing data is normal. Leave cells empty or use blank values. Don’t fill them with placeholders like N/A, null, none, 0, or -1, which will be treated as real values.
Do this ✔️Not this 🚫
Leave the cell emptyFill with N/A or null
Leave the cell emptyFill with 0 (unless 0 is a real value)
Leave the cell emptyFill with -999 or any sentinel value
If a column has mostly missing values (more than 80-90% blank), consider removing it entirely. A column with very little data won’t contribute much to model quality.

Duplicate Rows

Check for and remove exact duplicate rows before uploading. Duplicates can skew model training; the model will over-weight those patterns.

ID Columns

Columns like row_id, customer_id, or transaction_id are unique per row and don’t carry predictive signal. You can leave them in (they won’t hurt) but for cleaner results, consider removing pure ID columns before upload.

High-Cardinality Text

Columns where almost every value is unique (like free-text notes, email addresses, or URLs) don’t work well as features. They look categorical but have no repeating patterns for the model to learn from. Remove or replace them with something structured.
Remove or transformKeep
customer_email (unique per row)email_domain (repeating: gmail.com, company.com)
free_text_notesnote_length or has_notes (true/false)
full_addresscity, state, zip as separate columns

Preparing Data by Task

Different model types work best with different data shapes. Here’s what to keep in mind for each.

Prediction

You need a target column → the thing you want to predict. This is specified as label_column when training.
  • Classification (categorical target): The target column should contain discrete categories like churned/retained, high/medium/low, or approved/denied.
  • Regression (numeric target): The target column should contain continuous numbers like revenue, score, or duration.
Wood Wide AI auto-detects whether it’s classification or regression based on the target column values.
Make sure your target column is clean and well-defined. If you’re predicting churn, every row should have a clear churned or not_churned value — not a mix of blanks, maybes, and partial labels.

Clustering

No target column needed. Include the columns that describe the attributes you want to group by. If you want behavioral clusters, include behavioral columns (usage frequency, spend patterns, engagement metrics). If you want firmographic clusters, include firmographic columns (industry, size, region).

Anomaly Detection

No target column needed. Include columns that represent “normal” behavior. The model learns what normal looks like and flags rows that deviate. More columns describing typical patterns = better anomaly detection.

Factor Analysis

No target column needed. Include all columns you suspect might share underlying patterns. Factor analysis discovers the hidden structure that explains why your columns move together.

Pre-Upload Quick Check

Run through this before every upload:
1

Format

File is .csv or .parquet, under 30 MB (or use signed-URL upload for larger files).
2

Header row

First row contains column names. No blanks, no duplicates.
3

No formatting

No currency symbols, percentage signs, commas in numbers, formulas, or merged cells.
4

Consistent types

Each column uses one data type throughout. Numbers are numbers. Categories are spelled consistently.
5

Missing values

Blanks are truly blank; not filled with N/A, null, or 0 as placeholders.
6

No junk rows

No summary rows, total rows, or blank separator rows. Just data.
7

Target column (prediction only)

If training a prediction model: your target column exists, is clean, and has clear values.

What Happens After Upload

Once your data is uploaded, Wood Wide AI:
  1. Infers the schema: detects column names and types (numeric, categorical, datetime, binary) automatically.
  2. Versions your dataset: every upload creates a new version, so you can always go back.
  3. Handles inference alignment: when you run inference later, the system automatically aligns your new data to the training schema. Extra columns are dropped, missing columns are filled with nulls, and type mismatches are coerced where possible.
You don’t need to manually match your inference data to your training data. The platform handles it for you.

AI-Assisted Cleanup Prompts

If you’re not sure how to fix a data issue, paste your CSV into any LLM (like ChatGPT or Claude) along with one of these prompts. Each one targets a specific cleanup task.
I'm preparing this CSV to upload to a machine learning platform that accepts
tabular data (CSV or Parquet). The platform auto-detects column types as
numeric, categorical, datetime, or binary.

Please review my data and flag:
- Columns with mixed data types (e.g., numbers and text in the same column)
- Inconsistent categorical values (e.g., "Active", "active", "ACTIVE")
- Placeholder values used for missing data (e.g., "N/A", "null", "none", "-1")
- Columns that are likely pure IDs with no predictive value
- Currency symbols, percentage signs, or commas inside numeric values
- Summary or total rows mixed in with data rows
- Any other issues that could cause problems during model training

For each issue, tell me exactly what to fix and how.
Look at the categorical (text) columns in this CSV. For each one:
1. List all unique values
2. Flag inconsistencies (different capitalization, trailing spaces, typos,
   abbreviations vs. full names)
3. Suggest a standardized version of each value
4. Output a cleaned version of the CSV with all categorical values standardized
Check all columns in this CSV that should be numeric. For each one:
1. Remove currency symbols ($, €, £), percentage signs (%), and commas
2. Convert text like "1.5K" or "2M" to actual numbers (1500, 2000000)
3. Flag any values that can't be converted to a number
4. Replace non-numeric values with blank (empty) cells, not "N/A" or "0"
5. Output the cleaned CSV
Find all date or timestamp columns in this CSV. For each one:
1. List the different date formats you see (e.g., "3/15/25", "March 15, 2025",
   "2025-03-15")
2. Convert all dates to ISO 8601 format: YYYY-MM-DD (e.g., 2025-03-15)
3. Flag any values that look like dates but are ambiguous (e.g., "01/02/2025"
   could be Jan 2 or Feb 1)
4. Output the cleaned CSV
Scan every column in this CSV for placeholder values that represent missing data.
Common placeholders include: "N/A", "n/a", "NA", "null", "NULL", "none", "None",
"missing", "-", "--", "-1", "-999", "0" (when used as "unknown"), "TBD", "unknown",
"not available", "not applicable", "#N/A", "#REF!", "#VALUE!".

Replace all of these with truly empty cells (blank, no value).
Do NOT replace "0" if it appears to be a legitimate numeric value in that column.
List every replacement you made. 
Output the cleaned CSV.
Check this CSV for:
1. Exact duplicate rows -- list them and remove all but the first occurrence
2. Summary or total rows (rows that aggregate other rows, often at the bottom)
3. Blank separator rows or rows with no data
4. Header rows repeated in the middle of the data

Remove all junk rows. Tell me how many rows were removed and why.
Output the cleaned CSV.
For each text column in this CSV, count the number of unique values relative
to the total number of rows.

Flag columns where almost every value is unique (>90% unique) -- these are
likely IDs, emails, URLs, or free-text fields that won't help a model learn
patterns.

For each flagged column, suggest one of:
- Remove it entirely (if it's a pure ID or free-text field)
- Extract a useful feature from it (e.g., email → domain, URL → domain,
  full name → nothing useful, address → city/state/zip)

Apply your suggestions and output the cleaned CSV.
I want to train a prediction model on this CSV. My target column (the thing
I want to predict) is: [COLUMN NAME]

Please check this column and:
1. List all unique values and their counts
2. Flag any issues: missing values, inconsistent labels, ambiguous categories
3. Tell me whether this looks like a classification task (discrete categories)
   or regression task (continuous numbers)
4. If classification: suggest standardized label names if the current ones are
   inconsistent
5. If regression: flag any non-numeric values that need to be cleaned
6. Flag if the target is heavily imbalanced (e.g., 95% one class, 5% another)
7. Output the CSV with the target column cleaned
I need to prepare this CSV for upload to a machine learning platform. The
platform accepts CSV or Parquet, auto-detects column types (numeric,
categorical, datetime, binary), and has a 30 MB file size limit.

Please perform a full cleanup:
1. Remove exact duplicate rows
2. Remove summary/total rows and blank separator rows
3. Standardize all categorical values (consistent casing, no trailing spaces)
4. Clean numeric columns (remove $, %, commas; convert "1.5K" → 1500)
5. Standardize all dates to YYYY-MM-DD format
6. Replace placeholder missing values (N/A, null, none, etc.) with blank cells
7. Flag high-cardinality columns (>90% unique) and suggest removal or transformation
8. Flag columns that are likely pure IDs with no predictive value

For each change, briefly explain what you did and why.
Output the final cleaned CSV.

Exporting from Common Tools

File > Download > Comma-separated values (.csv)Before exporting: remove any filter views, unhide all rows/columns, and check that no cells contain formulas that haven’t been evaluated.
File > Save As > CSV UTF-8 (Comma delimited)Before exporting: select “Paste as Values” on any formula cells, unmerge all cells, and remove any summary rows or pivot tables from the data sheet.
Reports > Export > CSVCheck for Salesforce-specific formatting: currency fields may include $ or locale-specific symbols, and picklist fields may contain semicolon-delimited multi-values. Clean these before upload.
Export query results as CSV. Most database clients (DBeaver, pgAdmin, DataGrip) have a direct “Export to CSV” option. Make sure NULL values export as empty strings, not the literal text NULL.
df.to_csv("my_data.csv", index=False)
Use index=False to avoid adding an extra index column.

Dataset FAQs

There’s no strict minimum, but more data generally means better models. A few hundred rows can work for simple patterns. For complex prediction tasks, a few thousand rows or more will produce stronger results.
There’s no hard limit. Include the columns that are relevant to what you’re trying to learn or predict. Irrelevant columns add noise but won’t break anything.
Yes. Datetime columns are detected automatically. Use a standard format like YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ for best results.
The platform handles this automatically. Extra columns in your inference data are dropped. Missing columns are filled with nulls. Type mismatches are coerced where possible. You don’t need to manually align the two files.
Yes. Uploading a new file to an existing dataset creates a new version. Previous versions are preserved.
Use the signed-URL upload flow for files larger than 30 MB.