4 Cute Python Functions for Working with Dirty Data

These functions do a lot of heavy lifting

4 Cute Python Functions for Working with Dirty Data
Image shared under CC-BY-SA-4.0 by 1840151sudarshan at Wikimedia Commons

These functions do a lot of heavy lifting

Contents

  • Coalesce
  • Safe Get
  • Dig
  • Safe Cast
  • Conclusion

Coalesce

This function comes in handy when there are one or more possible values that could be assigned to a variable or used in a given situation and there is a known preference for which value among the options should be selected for use if it’s available.

If you have been writing SQL queries of at least a moderate complexity, the purpose of a coalesce function may already be clear to you. But for those who are unfamiliar with it, a motivating example may be instructive.

Let’s say we’re working with a subset of the response data for a story retrieved from Medium’s Stories API, and we want to extract a link to the featured image for a story. However, we would prefer to extract the link for the image with the highest available resolution.story = {
   "image": {
       "image_url_large": "https://cdn.images.site/large.png",
       "image_url_medium": "https://cdn.images.site/medium.png",
       "image_url_small": "https://cdn.images.site/small.png"
   },
}

In this case, the following code might suffice.featured_image_url = story["image"]["image_url_large"]

But, imagine that sometimes no image URLs are returned from the API (e.g. for stories that do not include any images) or the URL for each image size is present only some of the time.

We could write a series of if statements and use dict.get to safely extract a URL. However, the coalesce function defined above makes it so that we can extract the featured image URL quite simply and safely.featured_image_url = None
if story.get("image"):
   featured_image_url = coalesce(
       story1.get("image").get("image_url_large"),
       story1.get("image").get("image_url_medium"),
       story1.get("image").get("image_url_small")
   )

We could even forgo the presence check for the image field if the following is more to our liking. It’s certainly more appealing to look at.featured_image_url = coalesce(
   story.get("image", {}).get("image_url_large"),
   story.get("image", {}).get("image_url_medium"),
   story.get("image", {}).get("image_url_small")
)

It is worth noting that a similar, but not equivalent, behavior is achievable by using Python’s or operator and relying on Python’s truthiness semantics.featured_image_url = (
   story.get("image", {}).get("image_url_large") or
   story.get("image", {}).get("image_url_medium") or
   story.get("image", {}).get("image_url_small")
)

This would produce the correct result for our example, but it would fail in a situation where an empty string or zero are valid values that could be extracted. That is because empty string and zero are falsy values in Python.

In that case, we would need a function like our coalesce that is stricter about what values it considers to mean “nothing,” which is only the None value.

Safe Get

When working with JSON-style data, we extract values from the dict and list data structures as naturally as the sun rises and sets each day.

This means it can be an inconvenience not having a unified interface through which we can extract values from both types of collections.

The safe_get function beautifully provides that unified interface while also freeing us from wrapping our extraction code with error handling logic since the safe_get function simply returns None (or a provided default) if a value is not found.

With our implementation of safe_get, we could update our code for extracting the URL of the highest resolution image for a story’s featured image to be more functional in style.featured_image_url = coalesce(
   safe_get(safe_get(story, "image", {}), "image_url_large"),
   safe_get(safe_get(story, "image", {}), "image_url_medium"),
   safe_get(safe_get(story, "image", {}), "image_url_small")
)

As previously mentioned though, we also get a unified interface for extracting data from both dict and list data structures. So let’s demonstrate that.

Let’s again assume the data below is from Medium’s Stories API.featured_stories = [
   {"title": "Python: One Problem, Several Lessons",
    "description": "...",
    "body": "...",
    "image": {},
    "stats": {},},
   {"title": "Improving Code Quality in Python Codebases",
    "description": "...",
    "body": "...",
    "image": {},
    "stats": {},},
   {"title": "How to recursively reverse a linked list",
    "description": "...",
    "body": "...",
    "image": {},
    "stats": {},},
]

With this data, we’re interested in extracting the third featured story for a given Medium publication or user. Why? Because the saying “third time’s the charm” is one of our operating principles.

However, not every publication or user will have three or more stories from which we can pick the third one. But we don’t mind. We can optimistically write our extraction code using our safe_get function.third_featured_story = safe_get(featured_stories, 2) # 0-index 👨🏿‍🏫

If there is no third featured story, we’ll get back None from safe_get and be on our way.

Another great use for safe_get is as a building block function — for example, as seen in our implementation of the following dig function.

Dig

The dig function can be used to extract values from potentially nested dict and list data structures. It’s quite useful in data extraction and transformation because we often need to flatten nested data so that we can store it as structured data in a database.

I have spent the major of my time in data engineering (so far) using the Ruby programming language. Ruby has the Array#dig and Hash#dig methods, which make picking values from potentially nested dict and list data structures, in which values could be missing, quite easy. However, this functionality is not natively available in Python, so we have to build it ourselves.

Side note: A Python list corresponds to a Ruby Array, and a Python dict corresponds to a Ruby Hash.

Here is a reminder of what the implementation of our safe_get function looks like. We will be using it as a helper function in our implementation of dig which follows.

Now here is our implementation of the dig function using safe_get as a helper function.

We can once again update our code for extracting the URL of the highest resolution image for a story’s featured image using our dig function.featured_image_url = coalesce(
   dig(story, "image", "image_url_large"),
   dig(story, "image", "image_url_medium"),
   dig(story, "image", "image_url_small")
)

But let us go a bit further. Take the following setup for instance. We have nested dict and list data, and as indicated by the types on the attributes of the MediumStory class, there are some attributes (marked as Optional) that we do not always expect values to be present for.

Now, let’s write some code to extract and flatten this nested story data into our Python class.

Safe Cast

This function hardly needs a motivating example. It’s useful when we would prefer getting None (or some default value we set) if we fail to convert a value from one type to another.

One example that readily comes to mind is the conversion of a string value to a numeric value. Such a conversion would typically fail if unacceptable inputs such as empty string are provided, but with safe_cast we can suppress the exceptions and specify a default value if necessary. Examine the following code for how safe_cast might come in handy.

Conclusion

In this article, we looked at 4 Python functions that can help us safely extract or transform values from dirty data.

  • coalesce allows us to pick the first present (not-None) value among its arguments.
  • safe_get provides a unified interface that allows us to safely extract data from dict or list data structures.
  • dig lets us to reach into and extract values from potentially nested dict and list data structures.
  • safe_cast allows us to safely convert a value from one type to another.

Each of these functions is implemented in a way that minimizes the clutter of error handling logic by allowing us to pick a reasonable default return value (or None) in case we encounter a data extraction or transformation failure.

Additionally, they make it easy for us to extract data from the two most common data structures (dict and list), which are also what we get when we serialize JSON, the most popular data interchange format, for processing.

Thank you for reading!


More Computing resources

Watch videos covering a variety of topics in Computing at OnelTalksTech.com

Credits