Parsing data is nicer than only validating it

Published on in JavaScript, Joi, TypeScript and Zod

Last updated on

Parsing encompasses validation, plus it can increase type-safety, plus it can transform the data to a better format. Parse, don't just validate.

Table of contents

Validation

Validating data means checking if the data matches a certain shape or certain rules.

For example, when using Joi, a JavaScript validation library, you write a schema and check if some data matches the schema:

import Joi from 'joi'

const userSchema = Joi.object({
  createdOn: Joi.string().required().isoDate(),
  userName: Joi.string().required().min(2),
  userType: Joi.number().required().integer().min(1).max(3),
})

const response = await fetch('https://example.com/api/users/123')
const data: unknown = await response.json()

const validationResult = userSchema.validate(data)

if (validationResult.error) {
  console.error(validationResult.error)
} else {
  const user = validationResult.value
  //    ^? any
}

Validation is essentially a binary check: the data either is valid or it isn't.

Well, you do get descriptive validation error messages when using Joi.

But if the validation passes, you are none the wiser about the validated data, at least in terms of type-safety: the original data variable has still type unknown, and validationResult.value has type any.

While validating data is certainly useful, the result can be made more useful by parsing the data instead of only validating it.

Parsing

Parsing means processing some input and converting it to a more precise output.

Parsing does validation plus more: parsing = validating data + increasing type-safety + transforming data.

Parsing encompasses validation

Validation is a more-or-less implicit step of parsing: if the input data isn't valid (from the parser's perspective), the parsing can fail.

Examples:

Parsing can increase type-safety

In the Joi example above, Joi could in theory be able to know the result type.

For example, if an object schema contains a property with the value of Joi.string(), Joi could infer the type of that property to be string | undefined.[1]

But Joi doesn't do that. It throws away the potential type knowledge acquired during validation.

Example: Joi vs Zod

One alternative to Joi is Zod, a TypeScript validation library.

On the surface, Zod is used similarly to Joi, but the critical difference is that Zod can infer the return type of the schema:

-import Joi from 'joi'
-
-const userSchema = Joi.object({
-  createdOn: Joi.string().required().isoDate(),
-  userName: Joi.string().required().min(2),
-  userType: Joi.number().required().integer().min(1).max(3),
-})

+import { z } from 'zod'
+
+const userSchema = z.object({
+  createdOn: z.string().datetime(),
+  userName: z.string().min(2),
+  userType: z.number().int().min(1).max(3),
+})
+type User = z.infer<typeof userSchema>
+//   ^? {
+//        createdOn: string
+//        userName: string
+//        userType: number
+//      }

 const response = await fetch('https://example.com/api/users/123')
 const data: unknown = await response.json()

-const validationResult = userSchema.validate(data)
-
-if (validationResult.error) {
-  console.error(validationResult.error)
-} else {
-  const user = validationResult.value
-  //    ^? any
-}

+const parsingResult = userSchema.safeParse(data)
+
+if (parsingResult.success) {
+  const user = parsingResult.data
+  //    ^? User
+} else {
+  console.error(parsingResult.error)
+}

The user variable, which is the result of successful parsing, has now type User, which is inferred from the Zod schema.

What about JSON.parse?

I used JSON.parse as an example above when talking about the validation step of parsing. But hmm, why doesn't JSON.parse increase any type-safety...?

Maybe JSON.parse should be thought of as a very high-level parser. The TypeScript compiler can't have visibility on things happening at runtime, like what's passed to JSON.parse. If you want type-safety, you need to be more specific what you ask for, e.g. by using a type-aware runtime parser like Zod.

Parsing can transform data

So far I have talked only about validation and type-safety. What really separates parsing from validation (type-safe or not) is data transformations (or "data mapping" or "data reserialization").

Transforming data means modifying the data to a better format. Not by mutating the input data[2], but by returning a different shape than what was passed to the parser.

What a "better format" means is up to you. For example, you can:

  • Trim strings.
  • Convert values to different types, e.g. numerical strings to numbers, or date strings to Date objects.
  • Set default values for optional or nullable fields.
  • Transform object keys from PascalCase or snake_case (neither are JavaScript-y) to camelCase.
  • Rename object keys from a foreign language to English.
  • Rename object keys to words that are already used in your code base to avoid having multiple words for the same thing (data coherency); e.g. userId vs personId.
  • Compute/derive values from other values; e.g. construct fullName by combining firstName and lastName.
  • Simplify object structures, e.g. change Array<{ key: string; value: number }> to Record<string, number>.
  • Convert comma-separated strings to arrays, e.g. "1,2,3" to [1, 2, 3].
  • Convert number unions to string unions (see the userSchema example below).

The point is to modify the data so that it looks like you'd want it to look.

It's not fun to work with messy data, but if you transform the data to a nicer format right after receiving it:

  • You need to deal with the messy data only once.
  • The transformations are centralized to one place instead of being scattered around everywhere.
  • You can write business logic against the ideally shaped data (as opposed to writing business logic against "someone else's data").

Example: userSchema

In the Joi vs Zod example above, userSchema and the inferred type look like this:

import { z } from 'zod'

const userSchema = z.object({
  createdOn: z.string().datetime(),
  userName: z.string().min(2),
  userType: z.number().int().min(1).max(3),
})
type User = z.infer<typeof userSchema>
//   ^? {
//        createdOn: string
//        userName: string
//        userType: number
//      }

The User type is quite messy. If the data is coming from an external API, the data itself can't be changed... but the parsed data can be improved by:

  • Converting the createdOn string to a Date object, so you don't need to repeatedly do a conversion from a date string to a Date object. Date objects are nicer to work with than date strings.
  • Converting the userType number to a string union of 'admin' | 'moderator' | 'user'. A string union like that is much clearer, plus then you don't need to repeatedly do a conversion from a number to a string.
  • Renaming userName (uppercase N) to username (lowercase n) because "username" is often spelled as a single word.
  • Renaming userType to role because userType sounds redundant, and type would sound a bit too generic.

Like so:

type User = {
  createdOn: Date
  role: 'admin' | 'moderator' | 'user'
  username: string
}

function parseUser(data: unknown): User | ParseError {
  // Validate and transform `data`
}

How you transform the data is an implementation detail. Zod has means for transforming data (another blog post coming later); you could also transform the data manually.

Fuzzy terminology

What is parsing? What is validation?

Above I said that parsing = validating data + increasing type-safety + transforming data. This is based on other sources (see further resources below).

If the data is not transformed, I guess it could still be called just validation; be it "type-ignorant" validation (think of Joi) or type-aware validation (think of Zod).

In fact, Joi and Zod are both called validation libraries even though they also support data transformations. But that's also a bit confusing.

Practical guidelines

  • If your intention is to transform the data, make sure you call it parsing. Calling it validation would be confusing.
  • If your intention is to do only validation:
    • Call it validation or parsing; choose whichever term and stick to it (be consistent). Calling it validation would distinguish it from parsing that transforms data, though I don't know if that would valuable.
    • If you call it validation, make sure you don't transform the data (accidentally or deliberately); it would be confusing.

Further resources

My other blog posts on this topic:

More coming later; I like this topic.

Main inspiration

This blog post is largely inspired by the Parse, don't validate blog post by Alexis King. Her post has examples in Haskell, which I'm not familiar with, but the post was still easy to read.

These snippets resonated with me the most:

  • Validation vs parsing:

    [I]n my mind, the difference between validation and parsing lies almost entirely in how information is preserved. [...] Both of [them] check the same thing, but [parsing] gives the caller access to the information it learned, while [validating] just throws it away.

  • What is parsing:

    Consider: what is a parser? Really, a parser is just a function that consumes less-structured input and produces more-structured output. [...] Under this flexible definition, parsers are an incredibly powerful tool: they allow discharging checks on input up-front, right on the boundary between a program and the outside world, and once those checks have been performed, they never need to be checked again!

  • Beware "shotgun parsing":

    Ad-hoc validation leads to a phenomenon that the language-theoretic security field calls shotgun parsing. [...]

    Shotgun parsing is a programming antipattern whereby parsing and input-validating code is mixed with and spread across processing code [...].

    [...] In other words, a program that does not parse all of its input up front runs the risk of acting upon a valid portion of the input, discovering a different portion is invalid, and suddenly needing to roll back whatever modifications it already executed in order to maintain consistency. Sometimes this is possible – such as rolling back a transaction in an RDBMS – but in general it may not be.

    It may not be immediately apparent what shotgun parsing has to do with validation – after all, if you do all your validation up front, you mitigate the risk of shotgun parsing. The problem is that validation-based approaches make it extremely difficult or impossible to determine if everything was actually validated up front or if some of those so-called "impossible" cases might actually happen. The entire program must assume that raising an exception anywhere is not only possible, it's regularly necessary.

    Parsing avoids this problem by stratifying the program into two phases – parsing and execution – where failure due to invalid input can only happen in the first phase. The set of remaining failure modes during execution is minimal by comparison, and they can be handled with the tender care they require.

  • Guidelines for parsing – the blog post contains lots of great points; these few resonated with me the most:
    • Get your data into the most precise representation you need as quickly as you can. Ideally, this should happen at the boundary of your system, before any of the data is acted upon.

      If one particular code branch eventually requires a more precise representation of a piece of data, parse the data into the more precise representation as soon as the branch is selected.

    • [W]rite functions on the data representation you wish you had, not the data representation you are given. The design process then becomes an exercise in bridging the gap, often by working from both ends until they meet somewhere in the middle.

    • Don't be afraid to parse data in multiple passes. Avoiding shotgun parsing just means you shouldn't act on the input data before it's fully parsed, not that you can't use some of the input data to decide how to parse other input data. Plenty of useful parsers are context-sensitive.

After reading Alexis King's blog post, I wasn't sure of the relationship between parsing and validation. These Hacker News (HN) comments were helpful in realizing that "parse, don't validate" doesn't mean "parse data instead of validating it"; it means "parse data instead of only validating it":

  • An HN comment by lexi-labmda (Alexis King):

    The idea really is that parsing subsumes validation. If you're parsing, you don't have to validate, because validation is a natural side-effect of the parsing process. And indeed, I think that's the very thesis of the blog post: parsing is validation, just without throwing away all the information you learned once you've finished!

  • An excerpt from an HN comment by danharaj:

    The point is not that "parsing" and "validating" are distinct concepts, but that they're two different points of view on the same problem.

  • An HN comment by b3morales:

    You seem to suggest that it's possible to parse without validating, which I'm not sure I follow. Surely validation is just one of the phases or steps of parsing?

  • An HN comment by nsajko:

    The point is that validation is (or should/can be) a byproduct of parsing. I.e., you shouldn't "do both", rather the validation should be encompassed by the parsing, as much as it makes sense.

  • An excerpt from an HN comment by friendzis:

    Validation happens during parsing implicitly: parser will either return a valid object or throw an error, but parsing has an added benefit that the end result is a datum of a known datatype.

Other sources

Other helpful things I have stumbled upon, with a few grammatical fixes:

  • An excerpt from a Lobsters comment by kevincox:

    When accepting untrusted input you should not validate it, and accept or reject the user input. You should instead parse and reserialize the input. This ensures that you will only have data that you wrote yourself. It is also natural to throw away unknown fields and simply any odd formatting.

  • An excerpt from a Lobsters comment by jbert:

    Is [the "Parse, don't validate" idea] fundamentally the same idea as "make invalid states unrepresentable"?

    I.e. don't have a corner case of your type where you could have a value which isn't valid (like the empty list), instead tighten your type so that every possible value of the type is valid.

    Looked at this way, any time you have a gap (where your type is too loose and allows invalid states) the consumer of the value of that type needs to check them again. This burden is felt everywhere the type is used.

    If you do the check to ensure you haven't fallen into the gap, take the opportunity to tighten the type to get rid of the gap. I.e. make the invalid state unrepresentable.

  • An excerpt from a Lobsters comment by Axman6:

    As a Haskell developer, I've spent a lot of time looking at unit test suites for other languages and thinking "why are you testing that? your type system should make that impossible", and coming to the realisation that for many languages, their test suite is their type system – but it is an incomplete, sometimes buggy one limited by what the developer was able to predict could go wrong.

    I've always tried to make programs where only valid business logic is possible, and doing this relies heavily on the use of sum types to be precise about what is allowable – I was talking earlier today to someone on IRC about the difference between validating (a.k.a writing code that works with [booleans] to gate progress to other code) and parsing (writing code that produces valid values – perfectly summed up in Alexis King's Parse, don't validate).

    At a previous job, we did this religiously, data coming into the program was parsed (in this case from SQS queues), such that we knew in the rest of the system that all our preconditions had been satisfied, and only valid data was allowed within the inner shell of the app. [...]

    The rest of the program was basically implemented in a way that it was very difficult to write incorrect code, all alternatives were represented as subtypes [...].

  • An excerpt from Parse, don't validate, incoming data in TypeScript by Elias Nygren:

    The "parse, don't validate" mantra is all about parsing incoming data to a specific type, or failing in a controlled manner if the parsing fails. It's about using trusted, safe and typed data structures inside your code and making sure all incoming data is handled at the very edges of your programs. Don't pass incoming data deep into your code, parse it right away and fail fast if needed.

    Parsing is better than just validation because parsing forces you to explicitly handle all incoming data. It gives you a type safe way [of] working and makes it hard to pass around malicious content around your applications and data stores. However, it is true that parsing often includes validating the data.

  • Excerpts from Can types replace validation? by Mark Seemann:
    • The naive take on validation is to answer the question: Is that data valid or invalid? Notice the binary nature of the question. It's either-or.

    • This changes focus on validation. No longer is validation a true/false question. Validation is a function from less-structured data to more-structured data. Parse, don't validate.

      • The "This" at the beginning refers to "applicative validation" (I don't know what that is), but I think it would very well refer to parsing as well (i.e. "Parsing changes focus on validation"); the quote would still make sense.
  • An excerpt from Shotgun Semantics by Donny Winston:

    Developers often resort to shotgun parsing: scattering data checks and fallback values in various places throughout the system's main logic.

    The habit of scattering parser-like behaviour throughout an application's code and the resulting inconsistencies in data handling can often lead not just to annoying complications and bugs, but also security vulnerabilities.

  • An excerpt from a deleted user's comment on an /r/typescript post Has your team adopted TypeScript? How is it going?:

    I joined a team 6 months ago that had recently adopted TypeScript to a fairly new (2 years old) project. [...]

    People didn't really know why they were using it. They would create interfaces for objects coming over APIs (good), but would make them have slightly differently named fields, like mainId would be named main_id in the interface (bad). This resulted in pointless chunks of code to "convert" the API object to the front-end interface.

    • A reply by monnef:

      In some languages (like Haskell) this is the recommended approach – separate types for API and rest of the application. When the API changes, you don't have to change all code which works with the type, just the convert functions. Even in TS, I really don't like reusing API types (generated in our case) for forms and a few other places, because more often than not the types aren't actually the same (e.g. null vs undefined, number vs string and so on).

    • Another reply by pm_me_ur_happy_traiI:

      I actually encourage my team to do exactly this. Back-end APIs are optimized to be communicated over the wire. They need to be turned into something that works with front-end conventions and suits the needs of our app.

      We build our front-ends using data structures that make sense to the front-end, and there's usually a function to turn the back-end API response into that. The added upside is that if the back-end changes, we just need to fix the one function as opposed to renaming properties all over our code base.

  • An excerpt from Parsing REST API Responses in TypeScript using Zod by Matt Newhall:

    The ["parse, don't validate"] concept boils down to the underlying assumption in validation that a type can have valid and invalid instances. Let's take an example of validating a type, T:

    validateType: (T) => boolean
    

    In this case, validateType takes a type T and then by evaluating it returns a boolean stating if the incoming object was a valid object of type T or not. However, this means that invalid instances of T can exist. Alternatively, using parsing:

    parseType: (unknown) => T | Error
    

    We can take some data of an unknown type, and check if this matches what is expected of type T. From this, we can output a guaranteed type T, or handle the error in the event that it is not. This makes the schema checking system more TypeScript-esque, making it more akin to a type guard than blindly assuming a variable's type.

    • An interesting point that I hadn't considered: "[Validating instead of parsing] means that invalid instances of T can exist." On the other hand, shouldn't validateType take an unknown param, not a T param? Hmm...
  • An excerpt from Pushing Side Effects to the Side by Eric Weise:

    Functional programming provides a treasure trove of useful patterns, one of which is pushing side effects to the edges of your application. Side effects are functions that have non-deterministic behavior. Examples include making an HTTP call or persisting data to the database. [...] By isolating these types of calls, we can more easily test our code. [...]

    How many methods could you turn into simple pure function calls simply by removing side effect calls such as repositories from your business logic? Most projects I have worked on just sprinkle database calls throughout service layer.

    In general the pattern I try to follow is;

    • Retrieve all the data necessary to perform the business logic (side effect code).
    • Perform the business logic using in-memory data (pure functions).
    • Persist changes to the database (side effect code).

    These three steps can't be followed every time but when they can, the code will be much easier to test.

    • I think "pushing side effects to the side" doesn't benefit only testing, so "the code will be much easier to test" could also read "the code will be much clearer overall."
  • Slogan of the parse-dont-validate TypeScript library:

    Validating data acts as a gatekeeper, parsing them into meaningful data types adds valuable information to raw data.

  • A tweet by Mateusz Kwaśniewski: Parse, don't validate – TypeScript edition
    • Features Joi and Zod, which inspired me to use them in this blog post.
  • A GitHub discussion under the Zod project: Confusing terms – parse vs validate

Changelog

Footnotes

  1. The corresponding type of Joi.string() would be string | undefined, because in Joi, all values are optional by default. I find it a bit confusing. The type of Joi.string().required() would be string.

  2. I wonder if there's ever a good reason for a parser to mutate the input data.