Data Preparation

Everything you need to know before uploading your first dataset to Wood Wide AI. Wood Wide AI works with tabular data; rows and columns, like a spreadsheet or database export. This guide walks you through getting your data ready so training and inference go smoothly on the first try.

Supported Formats

CSV

Comma-separated values.
The most common format that works with Excel, Google Sheets, and every database export tool.

Parquet

Columnar storage format.
Preferred for large datasets with smaller file size and faster uploads.\

Direct uploads are limited to 30 MB. For larger files, Wood Wide AI provides a signed-URL upload flow that handles files of any size. See Large File Uploads for details.

Data Structure Checklist

Before uploading, walk through this checklist. If you can check every box, your data is ready.

1. One row per record

Each row should represent one observation, transaction, customer, event, or time period. Don’t nest multiple records into a single row or use merged cells.

Right	Wrong
One row per customer per month	Multiple months comma-separated in one cell
One row per transaction	Summary rows mixed in with detail rows
One row per device reading	Blank rows used as section dividers

2. One column per attribute

Each column should contain one type of information. Don’t combine multiple values into a single column.

Right	Wrong
Separate `city` and `state` columns	`San Francisco, CA` in one `location` column
Separate `first_name` and `last_name`	`John Smith` in one `name` column
`revenue` as a standalone number	`$1,234.56` with currency symbols and commas

3. A header row with clear column names

The first row must contain column names. Keep them short, descriptive, and consistent.Tips:

Use snake_case or plain words: monthly_revenue, signup_date, customer_segment
Avoid special characters, leading/trailing spaces, or duplicate column names
Don’t leave any column name blank

4. No formatting, formulas, or merged cells

If you’re exporting from a spreadsheet, strip everything back to raw values before saving as CSV.Remove:

Currency symbols ($, EUR) — keep the number only
Percentage signs (%) — use 0.15 instead of 15%
Commas in numbers — 1234567 not 1,234,567
Excel formulas — copy-paste as values first
Merged cells — unmerge and fill each cell
Summary/total rows at the bottom

5. Consistent values within each column

Every value in a column should follow the same format. Mixed formats confuse schema inference.

Column	Consistent	Inconsistent
`status`	`active`, `churned`, `trial`	`Active`, `ACTIVE`, `active`, `1`
`date`	`2025-03-15` throughout	Mix of `3/15/25`, `March 15, 2025`, `2025-03-15`
`revenue`	`14500.00` throughout	Mix of `$14,500`, `14500`, `14.5K`

Column Types

Wood Wide AI automatically infers the data type of each column when you upload a dataset. You don’t need to declare types manually. Here’s what gets detected:

Inferred Type	What It Looks Like	Examples
Numeric	Numbers (integers or decimals)	`42`, `3.14`, `-100`, `0.001`
Categorical	Text labels or codes with repeated values	`enterprise`, `smb`, `US`, `tier_1`
Datetime	Dates and timestamps	`2025-03-15`, `2025-03-15T09:30:00Z`
Binary	Two-value columns	`0`/`1`, `true`/`false`, `yes`/`no`

When in doubt, keep it simple. If a column has numbers, make sure every value is actually a number (no text mixed in). If it’s a category, make sure the same category is always spelled the same way.

Handling Common Issues

Missing Values

Missing data is normal. Leave cells empty or use blank values. Don’t fill them with placeholders like N/A, null, none, 0, or -1, which will be treated as real values.

Do this ✔️	Not this 🚫
Leave the cell empty	Fill with `N/A` or `null`
Leave the cell empty	Fill with `0` (unless 0 is a real value)
Leave the cell empty	Fill with `-999` or any sentinel value

If a column has mostly missing values (more than 80-90% blank), consider removing it entirely. A column with very little data won’t contribute much to model quality.

Duplicate Rows

Check for and remove exact duplicate rows before uploading. Duplicates can skew model training; the model will over-weight those patterns.

ID Columns

Columns like row_id, customer_id, or transaction_id are unique per row and don’t carry predictive signal. You can leave them in (they won’t hurt) but for cleaner results, consider removing pure ID columns before upload.

High-Cardinality Text

Columns where almost every value is unique (like free-text notes, email addresses, or URLs) don’t work well as features. They look categorical but have no repeating patterns for the model to learn from. Remove or replace them with something structured.

Remove or transform	Keep
`customer_email` (unique per row)	`email_domain` (repeating: `gmail.com`, `company.com`)
`free_text_notes`	`note_length` or `has_notes` (true/false)
`full_address`	`city`, `state`, `zip` as separate columns

Preparing Data by Task

Different model types work best with different data shapes. Here’s what to keep in mind for each.

Prediction

You need a target column → the thing you want to predict. This is specified as label_column when training.

Classification (categorical target): The target column should contain discrete categories like churned/retained, high/medium/low, or approved/denied.
Regression (numeric target): The target column should contain continuous numbers like revenue, score, or duration.

Wood Wide AI auto-detects whether it’s classification or regression based on the target column values.

Make sure your target column is clean and well-defined. If you’re predicting churn, every row should have a clear churned or not_churned value — not a mix of blanks, maybes, and partial labels.

Clustering

No target column needed. Include the columns that describe the attributes you want to group by. If you want behavioral clusters, include behavioral columns (usage frequency, spend patterns, engagement metrics). If you want firmographic clusters, include firmographic columns (industry, size, region).

Anomaly Detection

No target column needed. Include columns that represent “normal” behavior. The model learns what normal looks like and flags rows that deviate. More columns describing typical patterns = better anomaly detection.

Factor Analysis

No target column needed. Include all columns you suspect might share underlying patterns. Factor analysis discovers the hidden structure that explains why your columns move together.

Pre-Upload Quick Check

Run through this before every upload:

Format

File is .csv or .parquet, under 30 MB (or use signed-URL upload for larger files).

Header row

First row contains column names. No blanks, no duplicates.

No formatting

No currency symbols, percentage signs, commas in numbers, formulas, or merged cells.

Consistent types

Each column uses one data type throughout. Numbers are numbers. Categories are spelled consistently.

Missing values

Blanks are truly blank; not filled with N/A, null, or 0 as placeholders.

No junk rows

No summary rows, total rows, or blank separator rows. Just data.

Target column (prediction only)

If training a prediction model: your target column exists, is clean, and has clear values.

What Happens After Upload

Once your data is uploaded, Wood Wide AI:

Infers the schema: detects column names and types (numeric, categorical, datetime, binary) automatically.
Versions your dataset: every upload creates a new version, so you can always go back.
Handles inference alignment: when you run inference later, the system automatically aligns your new data to the training schema. Extra columns are dropped, missing columns are filled with nulls, and type mismatches are coerced where possible.

You don’t need to manually match your inference data to your training data. The platform handles it for you.

AI-Assisted Cleanup Prompts

If you’re not sure how to fix a data issue, paste your CSV into any LLM (like ChatGPT or Claude) along with one of these prompts. Each one targets a specific cleanup task.

General health check

I'm preparing this CSV to upload to a machine learning platform that accepts
tabular data (CSV or Parquet). The platform auto-detects column types as
numeric, categorical, datetime, or binary.

Please review my data and flag:
- Columns with mixed data types (e.g., numbers and text in the same column)
- Inconsistent categorical values (e.g., "Active", "active", "ACTIVE")
- Placeholder values used for missing data (e.g., "N/A", "null", "none", "-1")
- Columns that are likely pure IDs with no predictive value
- Currency symbols, percentage signs, or commas inside numeric values
- Summary or total rows mixed in with data rows
- Any other issues that could cause problems during model training

For each issue, tell me exactly what to fix and how.

Standardize categorical values

Look at the categorical (text) columns in this CSV. For each one:
1. List all unique values
2. Flag inconsistencies (different capitalization, trailing spaces, typos,
   abbreviations vs. full names)
3. Suggest a standardized version of each value
4. Output a cleaned version of the CSV with all categorical values standardized

Clean numeric columns

Check all columns in this CSV that should be numeric. For each one:
Remove currency symbols ($, €, £), percentage signs (%), and commas
Convert text like "1.5K" or "2M" to actual numbers (1500, 2000000)
Flag any values that can't be converted to a number
Replace non-numeric values with blank (empty) cells, not "N/A" or "0"
Output the cleaned CSV

Standardize dates

Find all date or timestamp columns in this CSV. For each one:
1. List the different date formats you see (e.g., "3/15/25", "March 15, 2025",
   "2025-03-15")
2. Convert all dates to ISO 8601 format: YYYY-MM-DD (e.g., 2025-03-15)
3. Flag any values that look like dates but are ambiguous (e.g., "01/02/2025"
   could be Jan 2 or Feb 1)
4. Output the cleaned CSV

Remove placeholder missing values

Scan every column in this CSV for placeholder values that represent missing data.
Common placeholders include: "N/A", "n/a", "NA", "null", "NULL", "none", "None",
"missing", "-", "--", "-1", "-999", "0" (when used as "unknown"), "TBD", "unknown",
"not available", "not applicable", "#N/A", "#REF!", "#VALUE!".

Replace all of these with truly empty cells (blank, no value).
Do NOT replace "0" if it appears to be a legitimate numeric value in that column.
List every replacement you made. 
Output the cleaned CSV.

Remove duplicates and junk rows

Check this CSV for:
1. Exact duplicate rows -- list them and remove all but the first occurrence
2. Summary or total rows (rows that aggregate other rows, often at the bottom)
3. Blank separator rows or rows with no data
4. Header rows repeated in the middle of the data

Remove all junk rows. Tell me how many rows were removed and why.
Output the cleaned CSV.

Reduce high-cardinality columns

For each text column in this CSV, count the number of unique values relative
to the total number of rows.

Flag columns where almost every value is unique (>90% unique) -- these are
likely IDs, emails, URLs, or free-text fields that won't help a model learn
patterns.

For each flagged column, suggest one of:
- Remove it entirely (if it's a pure ID or free-text field)
- Extract a useful feature from it (e.g., email → domain, URL → domain,
  full name → nothing useful, address → city/state/zip)

Apply your suggestions and output the cleaned CSV.

Prepare a prediction target column

I want to train a prediction model on this CSV. My target column (the thing
I want to predict) is: [COLUMN NAME]

Please check this column and:
1. List all unique values and their counts
2. Flag any issues: missing values, inconsistent labels, ambiguous categories
3. Tell me whether this looks like a classification task (discrete categories)
   or regression task (continuous numbers)
4. If classification: suggest standardized label names if the current ones are
   inconsistent
5. If regression: flag any non-numeric values that need to be cleaned
6. Flag if the target is heavily imbalanced (e.g., 95% one class, 5% another)
7. Output the CSV with the target column cleaned

Full cleanup → do everything

I need to prepare this CSV for upload to a machine learning platform. The
platform accepts CSV or Parquet, auto-detects column types (numeric,
categorical, datetime, binary), and has a 30 MB file size limit.

Please perform a full cleanup:
1. Remove exact duplicate rows
2. Remove summary/total rows and blank separator rows
3. Standardize all categorical values (consistent casing, no trailing spaces)
4. Clean numeric columns (remove $, %, commas; convert "1.5K" → 1500)
5. Standardize all dates to YYYY-MM-DD format
6. Replace placeholder missing values (N/A, null, none, etc.) with blank cells
7. Flag high-cardinality columns (>90% unique) and suggest removal or transformation
8. Flag columns that are likely pure IDs with no predictive value

For each change, briefly explain what you did and why.
Output the final cleaned CSV.

Exporting from Common Tools

Google Sheets

File > Download > Comma-separated values (.csv)Before exporting: remove any filter views, unhide all rows/columns, and check that no cells contain formulas that haven’t been evaluated.

Microsoft Excel

File > Save As > CSV UTF-8 (Comma delimited)Before exporting: select “Paste as Values” on any formula cells, unmerge all cells, and remove any summary rows or pivot tables from the data sheet.

Salesforce

Reports > Export > CSVCheck for Salesforce-specific formatting: currency fields may include $ or locale-specific symbols, and picklist fields may contain semicolon-delimited multi-values. Clean these before upload.

SQL / Database Export

Export query results as CSV. Most database clients (DBeaver, pgAdmin, DataGrip) have a direct “Export to CSV” option. Make sure NULL values export as empty strings, not the literal text NULL.

Python (pandas)

df.to_csv("my_data.csv", index=False)

Use index=False to avoid adding an extra index column.

Dataset FAQs

How many rows do I need?

There’s no strict minimum, but more data generally means better models. A few hundred rows can work for simple patterns. For complex prediction tasks, a few thousand rows or more will produce stronger results.

How many columns can I have?

There’s no hard limit. Include the columns that are relevant to what you’re trying to learn or predict. Irrelevant columns add noise but won’t break anything.

Can I include dates?

Yes. Datetime columns are detected automatically. Use a standard format like YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ for best results.

What if my inference data has different columns than my training data?

The platform handles this automatically. Extra columns in your inference data are dropped. Missing columns are filled with nulls. Type mismatches are coerced where possible. You don’t need to manually align the two files.

Can I update my dataset later?

Yes. Uploading a new file to an existing dataset creates a new version. Previous versions are preserved.

What if I have more than 30 MB of data?

Use the signed-URL upload flow for files larger than 30 MB.

Wood Wide AI

Capabilities

Guides

Supported Formats

CSV

Parquet

Data Structure Checklist

Column Types

Handling Common Issues

Missing Values

Duplicate Rows

ID Columns

High-Cardinality Text

Preparing Data by Task

Prediction

Clustering

Anomaly Detection

Factor Analysis

Pre-Upload Quick Check

What Happens After Upload

AI-Assisted Cleanup Prompts

Exporting from Common Tools

Dataset FAQs

Wood Wide AI

Capabilities

Guides

Documentation Index

​Supported Formats

CSV

Parquet

​Data Structure Checklist

​Column Types

​Handling Common Issues

​Missing Values

​Duplicate Rows

​ID Columns

​High-Cardinality Text

​Preparing Data by Task

​Prediction

​Clustering

​Anomaly Detection

​Factor Analysis

​Pre-Upload Quick Check

​What Happens After Upload

​AI-Assisted Cleanup Prompts

​Exporting from Common Tools

​Dataset FAQs

Supported Formats

Data Structure Checklist

Column Types

Handling Common Issues

Missing Values

Duplicate Rows

ID Columns

High-Cardinality Text

Preparing Data by Task

Prediction

Clustering

Anomaly Detection

Factor Analysis

Pre-Upload Quick Check

What Happens After Upload

AI-Assisted Cleanup Prompts

Exporting from Common Tools

Dataset FAQs