Bad Data Doesn't Happen Magically: A Guide to Finding and Fixing Your Data's Points of Failure

Jun 3

Things that happen magically:

Successful parallel parking on the first try
Airplane WiFi that works
Finding an avocado that’s ripe on the day you need it

Things that don’t happen magically:Bad data

I work with a lot of organizations that throw their hands up when they talk about the state of their data. "Ugh - it's a mess." "We've always had data problems." "That's just how it is."

Bad data is often what stands between organizations and the value their data should be providing. Deeper insights, meaningful evaluation, automated reporting, and better decisions depend on data you can trust.

Bad data doesn't magically happen. In fact, there are really just a few common points of failure. These apply across any type of data your organization collects or uses: surveys and assessments, staff and volunteer observations, administrative records, and secondary data from external sources. Once you figure out where yours went awry, you can start taking steps to fix it.

In this article I'll walk through these common points of failure and what you can do about them:

Failure Point #1: Bad Data at the Source - The issue starts at the point of collection.
Failure Point #2: Multiple Conflicting Sources - You have more than one version of the truth, and you're not sure which one to trust.
Failure Point #3: Data Degrading After Collection - The data was fine when you collected it, but then something happened afterward.
Failure Point #4: Data Systems & Structures - Your data exists, but it's organized in a way that makes it hard to connect, compare, or use.

Failure Point #1: Bad Data at the Source | The issue starts at the point of collection.

If your data is flawed at the point of collection, those issues travel downstream into every analysis, report, and decision that relies on that data.

What this looks like:

Your data doesn't fully reflect the population or situation you're trying to understand

Data is only collected for program enrollees when you need to understand everyone impacted by an issue
A public dataset covers all residents, but you need to understand a specific subgroup, like renters or a particular neighborhood
Long-term outcome data is only available for a limited number of participants, making it hard to draw conclusions about lasting impact

Your questions and response options aren't set up for success

This failure point is most relevant to surveys, interviews, and assessments where data is collected directly from participants or community members. Answering a survey question well is actually a three-step process: 1) respondents need to understand what you're asking, 2) they should be able to come up with an accurate answer, and 3) they need to map that answer to the response options provided. Breakdowns at any step produce bad data.

Understandability: Questions need to be clear and unambiguous enough that they’re clearly understood, and every respondent interprets them the same way. Be on the lookout for:
- Vague or ambiguous questions: "How often do you engage with our organization?" Engage how? Attending events, reading emails, visiting the website? Without a clear definition, every respondent is answering a slightly different question.
- Double-barreled questions: "Have you heard of or used our program?" bundles two concepts into one, driving inconsistent responses and making it impossible to understand awareness vs. usage separately.
- Undefined concepts: "Did the participant’s health improve?" will mean something different to every person answering it.
Recall: Respondents can only answer accurately based on what they can reliably remember. "How many times have you purchased gas in the last 6 months?" is much harder to answer accurately than "in the last week." Consider the time frame and level of specificity you’re asking for, and when recall might be an issue, using ranges ("1-3 times" vs. an exact number) can produce more reliable data.
Mapping to responses: Response options need to be clear and cover the full range of possible answers. Overlapping or non-exhaustive options make it difficult to record data accurately.
- When everyone selects "Other": Structured response options only work if they reflect the actual range of answers people have. If your options don't fit reality, you'll end up with a free text "Other" field that's just as hard to analyze as if you'd not provided options at all.
- Mismatched scales: Response scales need to match the nature of the question. A 1-10 rating for a question that's really just yes or no forces respondents to make an arbitrary choice. A complex scale where a simple agree/disagree would work better adds unnecessary cognitive load and produces less reliable data.

Your data collection tools don't enforce consistency

When there are no guardrails on how data is entered, the same value can come in a dozen different ways, and what seems like a small problem becomes a big one at scale.

Unformatted numeric or date fields: The same dollar amount enters as $10k, 10,000, and $10000. A short text field for dates allows entries like 8-10, 8/10/2025, August 10, 2025, or even 8-40-25. Inconsistent formats make data more error-prone and nearly impossible to analyze reliably.
Unrealistic or erroneous values slip through undetected: Without validation rules, obviously wrong values get recorded alongside legitimate ones. This could be a donation amount of $1,000,000 when your average gift is $50, a program start date 20 years in the future, or a negative attendance count. These errors can significantly skew your analysis if not caught and addressed.
Open text where structured responses would work better: Free text fields generate a different answer every time, while a dropdown or multiple choice option captures data more consistently. Information captured as free text often requires significant manual cleanup and processing before the data is usable.

Any one of these issues puts your data at risk of being cast aside or deemed unusable. If you're investing resources in collecting it, it's worth taking the time to get it right.

What to do about it:

Start by looking critically at the cleanup you’re doing

Recurring cleanup is a signal the fix probably belongs at the source, not downstream.

Track the types of data quality issues you're fixing most often. Trace them back to their origin. Is it a formatting problem? A vague question? A field with no guardrails?
Note how much time you're spending mining free text responses for useful information. If it's a lot, that's a sign the collection method needs to change.

Review how you’re collecting data

For each key data point, ask: Are the questions and responses clear enough that anyone collecting or providing this data would do so consistently? Is the data returned easy to use for analysis? If the answer to either is no, that's your starting point. Keep an eye out for:

Unclear questions or response options: Clarify the definition, add or refine response options, or provide a rubric with clear definitions for each response (e.g. what does a 1, 5, or 10 actually mean in your context?)
Inconsistent data entry formats: If data should be entered as a specific type, like a date or currency amount, add guardrails to ensure consistency and prevent erroneous values (this is surprisingly easy to do, even in a Google Form!). You can also limit the range of responses allowed.
Underused or hard to use free text responses: If data should be entered as a specific type, like a date or currency amount, add guardrails to ensure consistency and prevent erroneous values. You can also set limits on the range of accepted values, like requiring dates to fall within a reasonable window or capping numeric entries at a realistic maximum (this is surprisingly easy to do, even in a Google Form!).
Questions you aren't using or can't capture reliably: Remove them. Every question adds burden to the respondent and noise to your dataset.

If you know or suspect your data is incomplete or unrepresentative, determine the cause

A population or scenario is systematically being missed:Identify who or what isn't captured and why. Can your process be updated to include them? If so, that's your fix.
The data for your specific population isn't available: Some gaps can't be closed. Data from the Census’ American Community Survey (ACS), for example, can't always be filtered down to your exact population.

If the issue can’t be fixed, understand the implications rather than ignoring them. Does your data consistently underestimate or overestimate? Is the impact large or small? Are results still directionally useful? Sometimes just understanding the size and direction of the gap is enough to know what you can and can't responsibly conclude from the data.

Failure Point #2: Multiple Conflicting Sources | You have more than one version of the truth, and you're not sure which one to trust. This could happen at the very beginning (at collection), or at the end (in reporting or other outputs).

What this looks like:

Multiple systems are recording the same data

When the same data is being recorded in more than one place, it becomes unclear which one to believe.

Your public-facing donation platform and the accounting team's records show different totals for the same time period
Program enrollment is tracked in a CRM and also in a spreadsheet maintained by program staff, and the two never quite match.

Multiple teams are using the same data differently

When teams pull from the same source but apply different definitions, timeframes, or calculations, the numbers tell different stories, creating a lack of trust in the data and taking significant time to reconcile. Common culprits include:

Different definitions for the same metric: Two departments are both tracking "program participants" but one counts unique individuals and the other counts total enrollments. Both are defensible definitions, but they're not comparable, and presenting both without context creates confusion.
Timing mismatches: The development team reports on a calendar year, finance reports on a fiscal year, and program staff report on a grant cycle. The underlying data is the same, but the numbers look completely different depending on who's presenting them.

Data lives in personal files, inboxes, and informal notes

When data is scattered rather than a shared system, data discrepancies are inevitable, and it’s a struggle to see the full picture.

Funder reporting lives in the development director’s email threads, program outcomes are in a spreadsheet only the program team uses, and costs are in a finance system nobody else can access.
Holly keeps her own copy of the program performance tracking data, which she updates manually every week. The program director pulls from the shared drive version, which hasn't been touched in months.

What to do about it:

Trace the data from the final outputs to the source to find where discrepancies are introduced

If the data comes from the same source but discrepancies come from different definitions or usage - Get the right people in a room and align on definitions, timeframes, and calculation methods. Determine whether the differences are intentional and necessary. If they're not, fix them. If they are, agree and document.
If the issue comes from multiple sources- Decide which system is the trusted source. If only one system needs to exist, stop recording in the other. If both need to exist for different reasons, establish a clear process for keeping them in sync and agree on which one takes precedence when they conflict.

Establish a single source of truth for key data points

Decide which system(s) or dataset(s) are the trusted source for each type of data, and create a directory or shared repository for the data, so everyone knows where to go. Eliminate places where the same data is being independently recorded or maintained.

Document what you've agreed on

Once discrepancies are resolved and sources are centralized, create a data dictionary to address common questions and help make sure everyone is on the same page.

A ***data dictionary*** is one of the most useful tools for addressing many of these failure points. This is a central reference that documents your data sources, fields, definitions, known limitations, and any nuances that affect how the data should be interpreted. You can use this *template* *to get you started.*

Failure Point #3: Data Degrading After Collection | The data was fine when you collected it, but then something happened afterward.

What this looks like:

Data collection has changed over time

When questions, response options, or collection methods evolve, data from different time periods may no longer be directly comparable, even when it looks like it's measuring the same thing.

A survey question has been reworded, and while it seems similar, respondents may be interpreting it differently than they did before
Response formats have changed, like switching from ranges ("1-10 times") to exact numeric entry, making it impossible to compare historical and current responses directly
A program redefined what counts as a "completed session" midway through, and the data before and after the change reflects two different definitions.

Data is being manipulated post-collection

When data is updated, adjusted, or cleaned up after the fact without a clear record of what changed and why, it becomes difficult to know what you're working with or whether you can trust it.

Someone corrects what looks like an error but is actually a legitimate edge case
A volunteer or staff member edits responses to open text fields, standardizing language in a way that changes the meaning

Data is stale or out of date

Data that was accurate at the time of collection can become unreliable as circumstances change.

Contact information or situational data collected at intake never gets updated and doesn’t reflect current circumstances
Program status fields that aren’t consistently updated, so records show participants as active who completed or left the program months ago

What to do about it:

Create continuity across different iterations of data collection where you can

When collection methods have changed over time, the goal is to make historical and current data as comparable as possible, without forcing comparisons that aren't valid.

Map fields that hold the same data across time periods - even if names or formats have changed. But be careful: only map fields that are genuinely measuring the same thing!
“Stack” the data where it's valid to do so. Bringing historical and current data together into one place makes it much easier to analyze trends over time. Fields that only exist pre- or post-change are fine to include, just make sure it's clear which time periods they apply to.
Document nuances that apply to specific time periods in your data dictionary - what changed, when it changed, and what it means for how the data should be interpreted.

Establish processes to keep data current and to know when it isn't

Document in your data dictionary what triggers data to be updated (e.g. contact information verified at participant check ins if older than 12 months), and how long the data is considered reliable. A mailing address might be trusted for a year or two, but not ten. Setting these rules up front makes it easier to flag stale records.
Create a process to update data that tends to become stale. If no update trigger exists for fields that change frequently, like contact information or program status, build a regular review or update trigger into your workflow to keep data up to date.
Add a date of last update where possible. A simple timestamp on key fields makes it immediately clear how fresh the data is and helps identify records that could be out-of-date.

Establish controls around who can access and edit your data, and create an audit trail for changes that are made

Access controls: Not everyone who needs to view data needs to be able to edit it. Define who can view, who can edit, and who can delete, and set permissions accordingly.
Clear guidelines for data adjustments: When data does need to be updated or corrected after collection, make sure you have a process to document why the change was made, who made it, and what the original value was.
Audit trails: Where possible, use tools that automatically track changes, including what was changed, when, and by whom. If your tool doesn't support this natively, a simple change log maintained alongside the data is better than nothing.

Failure Point #4: Data Systems & Structures | Your data exists, but it's organized in a way that makes it hard to connect, compare, or use.

What this looks like:

Your data structure makes analysis difficult

Data is organized in a way that doesn't easily allow you to use the data, or creates a lot of extra work.

All of your program data, participant information, session details, outcomes, and contact info, lives in one massive file, making it nearly impossible to find what you need or run any meaningful analysis
Your CRM stores donor information, volunteer hours, and program participation all in the same place in a way that makes it hard to understand any one of them clearly

Your data can't be connected across systems or tables

You have the pieces you need, but no way to put them together.

One system has person-level data, and another has activity-level data, but you need to put them together to be able to see the full picture.
Records don’t have a unique identifier - Is John Smith in one system the same as John W. Smith in another?

The tool doesn't match the data structure

Using a donor CRM to track program participants often creates a “square peg in a round hole” situation, so data is tough to work with and doesn’t clearly reflect how your program works.

Important context lives outside the data

The numbers are there but the meaning is in someone's head or in a separate document. You have attendance counts but the key to what each location code means is in a spreadsheet only one person knows about.

What to do about it:

Reassess whether your tools are working for you

The tool you have may not be the right fit for the data you're managing, and that mismatch creates ongoing friction that compounds over time. If you're using a donor CRM to track program participants, or forcing data into a system that wasn't designed for it, it's worth asking whether a different tool would serve you better.

Improve how your data is structured and connected

Even within your existing tools, how data is organized has a big impact on how usable it is.

Create unique identifiers for key data objects. If records across systems or tables can't be connected, a consistent ID assigned to each person, case, or record is often the simplest fix.
Restructure within your existing tools where you can. Sometimes the fix is as simple as reorganizing a spreadsheet, splitting one unwieldy table into logical, connectable pieces, or standardizing how fields are named and formatted. Before assuming you need something new, look at whether what you have can be reorganized to work better.
Lightweight tools can make a big difference without an overhaul. You don't necessarily need to replace your existing systems to get more out of your data. Targeted, lightweight solutions that sit alongside your existing tools and workflows can clean up, combine, and organize data across various sources in ways that make it far more usable. Check out this case study for a great example.

Use a data dictionary and shared repositories to capture and share key context

Numbers without context are hard to trust and harder to use. Capturing that context in a shared, accessible place is what turns a dataset into something a whole team can confidently work with.

Document field definitions, codes, and abbreviations in your data dictionary.
Store reference documents, value keys, and supporting materials in a shared location that anyone on the team can access.

Bad data doesn't happen magically.

Unfortunately, it doesn't get fixed magically either.

But now you know where to look. Whether your issues start at the source, show up in conflicting versions, creep in after collection, or live in a system that was never quite set up right, there's a point of failure behind it, and a path to fixing it.

Abracadabra. 🪄

Beth Hegland

Bad Data Doesn't Happen Magically: A Guide to Finding and Fixing Your Data's Points of Failure

Failure Point #1: Bad Data at the Source | The issue starts at the point of collection.

Failure Point #2: Multiple Conflicting Sources | You have more than one version of the truth, and you're not sure which one to trust. This could happen at the very beginning (at collection), or at the end (in reporting or other outputs).

Failure Point #3: Data Degrading After Collection | The data was fine when you collected it, but then something happened afterward.

Failure Point #4: Data Systems & Structures | Your data exists, but it's organized in a way that makes it hard to connect, compare, or use.

Bad data doesn't happen magically.

Abracadabra. 🪄

Community Engagement Essentials: A Guide to Gathering Insights that Matter

Start Small, Learn Fast: A Human-Centered Data Pilot

Impactful Insights Data Partners