how spark handles null and abnormal values
About NULL in Spark
Just like Mysql, spark can have null values included in its columns.
for instance this dataframe includes the null value.
scala> df2.show()
+--------+------+----+
| _c0| _c1| _c2|
+--------+------+----+
| Joel Z| Male| 23|
| Nancy H|Female| 19|
| Lily X|Female|null|
|Joanny C| Male| 30|
+--------+------+----+
scala> df2.printSchema()
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: integer (nullable = true)
To filter out the null values we can use neither "===" nor "=!=". The codes can't work below.
scala> df2.select("*").where($"_c2" === null).show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
+---+---+---+
Instead we should use isNull method as follows.
scala> df2.select("*").where($"_c2".isNull).show()
+------+------+----+
| _c0| _c1| _c2|
+------+------+----+
|Lily X|Female|null|
+------+------+----+
This is the same as Mysql. In Mysql we can't compare a null value. for example, given the following table.
mysql> select * from fruits;
+----+--------+--------+
| id | fruit | number |
+----+--------+--------+
| 1 | plum | 3 |
| 2 | orange | 4 |
| 3 | apple | 2 |
| 4 | cherry | NULL |
+----+--------+--------+
4 rows in set (0.00 sec)
mysql> desc fruits;
+--------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| fruit | varchar(32) | YES | | NULL | |
| number | int(11) | YES | | NULL | |
+--------+-------------+------+-----+---------+----------------+
3 rows in set (0.01 sec)
This query gets nothing:
mysql> select * from fruits where number = null;
Empty set (0.00 sec)
Instead we should use this kind of query:
mysql> select * from fruits where number is null;
+----+--------+--------+
| id | fruit | number |
+----+--------+--------+
| 4 | cherry | NULL |
+----+--------+--------+
1 row in set (0.00 sec)
So both Spark and Mysql handle NULL with the same mechanism. NULL can't be compared directly, but be queried by the specific methods.
In Spark, the methods are "isNull" and "isNotNull". In mysql, they are "is null" and "is not null".
About abnormal values in Spark
From what I saw, Spark can handle abnormal values automatically.
For instance, we have this dataframe below.
scala> df.printSchema()
root
|-- fruit: string (nullable = true)
|-- number: string (nullable = true)
scala> df.show()
+------+------+
| fruit|number|
+------+------+
| apple| 2|
|orange| 5|
|cherry| 7|
| plum| xyz|
+------+------+
This dataframe has two (critical) problems:
- the number column is in wrong type. it should be Int type, but now it's String.
- the number column has abnormal value included, it's "xyz".
Even though spark still works well for aggregating against the number column.
scala> df.agg(avg("number")).show()
+-----------------+
| avg(number)|
+-----------------+
|4.666666666666667|
+-----------------+
So I am guessing:
- Spark can make some auto translation from string to numeric when aggregating.
- Spark ignore those abnormal values automatically when calculating the relevant data.
Though I was somewhat surprised by this, but this is maybe common for Spark. Since it can inferSchema from the external datasource, it should have the powerful capability on handling abnormal values.