When you got a dataset to explore, there are several ways to do that in PySpark. You can do a describe or a summary. But if you want something a little more advanced, and if you want to get a bit of a view of what is in there, you might want to go data profiling.
Older documentation might point you to Pandas profiling, but this functionality is now part of the Python package ydata-profiling (which is imported as ydata_profiling).
I’ve been following this blog on starting with ydata-profiling:
https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html
Getting ydata-profiling to work is not exactly a walk in the park. You’d think you can just feed it your messy dataset and it will show you what the data is like. But I encountered some problems:
- I got errors about missing Python packages in some situations.
- ydata doesn’t seem to like dataframes with only string columns.
Python packages
I’ve followed the examples from the above mentioned Databricks blog. You can do so too, but maybe use a newer version of the package. They use version 4.0.0 and a lot has happened since that version. I’ve used 4.16.1.
On Databricks Community Edition I got this error:
AttributeError: module 'numba' has no attribute 'generated_jit'
This Stack Overflow answer tells you how to deal with that:
You probably need data types
I used a couple of favourite datasets like olympic athletes and the asteroid dataset of the Minor Planet Center. And also a csv file I generated with lots of data quality issues.
This error occurred quite a lot when I tried to feed the raw data to ydata-profiling:
java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
Which is followed by this error:
Caused by: java.lang.IllegalArgumentException: requirement failed: Vector should have dimension larger than zero.
I’ve found that some other people ran into this error and it seems to be related to dataframes with columns that are all strings? Though when I casted one of my columns in the athlete date frame to integer, the issue wasn’t solved immediately.
So you might need to so more casting to correct data types. Which is bad for datasets that actually do have columns that are all strings for good reasons.
ydata-profiling usually works better with bigger datasets. I tried it on small dataframes with about 10 rows and some columns that were all NULL and it doesn’t seem to like that.
Running a ProfileReport
If you manage to solve that, you can run a profiling report with this code:
from ydata_profiling import ProfileReport
report = ProfileReport(athlete_df,
title='Athletes',
infer_dtypes=False,
interactions=None,
missing_diagrams=None,
correlations={"auto": {"calculate": False},
"pearson": {"calculate": True},
"spearman": {"calculate": True}})
And then show the results with this:
report_html = report.to_html()
displayHTML(report_html)
You can view the results in a notebook on Azure Fabric or Databricks.
The ProfilingReport
The report starts with some statistics:

The more interesting part, is the Alerts tab. I was quite surprised to see that the athlete dataset has duplicate rows:

It had a link that in my Databricks Community Edition didn’t seem to go anywhere. But a table with duplicate rows is at the end of the report:

Turns out the art competitions (yes the Olympics had those once) seem to be the only rows affected.
The alerts also show possible correlations and zeros.

Then it has statistics for each column. Take for example the Sex column.

Okay. It’s 2025. I know there are more than just male and female genders. But I didn’t know the gender Jr.” or -Smith)”. That points to data load issues that we need to investigate. I’m guessing quotes in strings maybe?
For numeric columns we can get a lot of statistics. Take a look at the details for the Age column:

We can also look at extreme values. Wow! There was a 97 year old athlete? I should query that data. See if that actually is proper data.

[Update 24 April 2025 16:30] I’ve looked up the row for the athlete with the age of 97. The “athlete” was John Quincy Adams Ward from the USA. He competed apparently in the art competition of the summer olympics of Amsterdam 1928. Which is strange, because he died in 1910 80 years old. But his work was exhibited in 1928. He would have been 97 years old then, but that doesn’t make the data less weird.
At the end part of the report we find correlations. I turns out there are correlations between Height and Weight. And that makes sense.

Conclusions
I can definitely see it’s definitely worthwhile to do data profiling with ydata-profiling, even though it might not work immediately at the start. Even for this dataset that I thought I already knew quite well, I ran into surprising results.
For raw datasets, expect to have to do some work to get your profiling reports.