script and spark for small dataset
Few days ago I met an issue with perl to analyse 80+ million items from a csv file.
That will break the memory since I run it on a 2G memory VPS.
Today I rerun it with another dataset which is smaller having 5 million items only.
This time perl does the job pretty well. The script as follows:
use strict;
my %hash;
my %stat;
# dataset:
# Arthur Y.B.,Male,1979-7-10,56607,[email protected],Loan Officer,23331.0
open HD,"person.csv" or die $!;
while(<HD>) {
chomp;
my ($job,$salary) = (split /\,/)[-2,-1];
$hash{$job}{total} += $salary;
$hash{$job}{count} +=1;
}
close HD;
for my $key (keys %hash) {
$stat{$key} = $hash{$key}{total} / $hash{$key}{count};
}
my $i = 0;
for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
printf "%-40s%.10f\n", $_, $stat{$_};
last if $i == 19;
$i ++;
}
And the perl script's running result:
$ time perl count.pl
Software Developer 17572.9448866016
Plumber 17572.5436757512
Recreation & Fitness Worker 17568.1629235022
Veterinary Technologist & Technician 17567.2669899038
Occupational Therapist 17562.4286936453
Cashier 17553.7658357742
Marketing Manager 17553.3646477380
Maid & Housekeeper 17551.8888711735
Executive Assistant 17550.3703990471
Diagnostic Medical Sonographer 17549.4909587512
Medical Assistant 17548.7211987571
Financial Analyst 17545.7428859941
Logistician 17543.4005038291
Financial Advisor 17542.6550936134
Landscaper & Groundskeeper 17541.8355272385
Telemarketer 17538.7797860791
Sales Manager 17534.8601334528
Construction Manager 17534.3827493797
Marriage & Family Therapist 17531.9513995878
Auto Mechanic 17527.9992342878
real 0m6.403s
user 0m6.304s
sys 0m0.088s
While Apache Spark run it for 7 seconds. Perl is even a bit faster.
scala> df.groupBy("job").agg(avg("salary").alias("avg_salary")).orderBy(desc("avg_salary")).show(false)
+------------------------------------+------------------+
|job |avg_salary |
+------------------------------------+------------------+
|Software Developer |17572.94488660164 |
|Plumber |17572.54367575122 |
|Recreation & Fitness Worker |17568.16292350223 |
|Veterinary Technologist & Technician|17567.266989903826|
|Occupational Therapist |17562.4286936453 |
|Cashier |17553.765835774204|
|Marketing Manager |17553.36464773796 |
|Maid & Housekeeper |17551.88887117347 |
|Executive Assistant |17550.370399047053|
|Diagnostic Medical Sonographer |17549.490958751172|
|Medical Assistant |17548.721198757143|
|Financial Analyst |17545.742885994143|
|Logistician |17543.4005038291 |
|Financial Advisor |17542.655093613437|
|Landscaper & Groundskeeper |17541.835527238483|
|Telemarketer |17538.779786079052|
|Sales Manager |17534.860133452843|
|Construction Manager |17534.382749379653|
|Marriage & Family Therapist |17531.951399587826|
|Auto Mechanic |17527.99923428779 |
+------------------------------------+------------------+
only showing top 20 rows
When there is no memory limit a script can finish the statistics job quickly.
Though most of our online datasets have 100+ million items. This will not be suitable for a script to analyse.