Today I met a strange issue, the perl script run out of memory.

The running result:

 $ perl count.pl
Out of memory!
Killed

This is the script content:

 use strict;
my %hash;
my %stat;
# dataset: userId, itemId, rate, time
# AV056ETQ5RXLN,0000031887,1.0,1397692800
open HD,"rate.csv" or die $!;
while(<HD>) {
my ($item,$rate) = (split /\,/)[1,2];
$hash{$item}{total} += $rate;
$hash{$item}{count} +=1;
}
close HD;
for my $key (keys %hash) {
$stat{$key} = $hash{$key}{total} / $hash{$key}{count};
}
my $i = 0;
for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
print "$_: $stat{$_}\n";
last if $i == 99;
$i ++;
}

The purpose is to aggregate and average the itemId's scores, and print the result after sorting.

As you see what in the script, the dataset is a csv file, each item includes the four parts:

 userId, itemId, rate, time

The size of dataset:

 $ wc -l rate.csv 
82677131 rate.csv

The memory is somewhat limited:

 $ free -m
total used free shared buff/cache available
Mem: 1992 1385 76 0 530 466
Swap: 1023 779 244

What confused me is that Apache Spark can make this job done with this limited memory. It got the statistics done within 2 minutes. But I want to give perl a try since it's not that convenient to run a spark job always.

The spark implementation:

 scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
val schema: String = uid STRING,item STRING,rate FLOAT,time INT

scala> val df = spark.read.format("csv").schema(schema).load("skydrive/rate.csv")
val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ... 2 more fields]

scala> df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show()
+----------+--------+
| item|avg_rate|
+----------+--------+
|0001061100| 5.0|
|0001543849| 5.0|
|0001061127| 5.0|
|0001019880| 5.0|
|0001062395| 5.0|
|0000143502| 5.0|
|000014357X| 5.0|
|0001527665| 5.0|
|000107461X| 5.0|
|0000191639| 5.0|
|0001127748| 5.0|
|0000791156| 5.0|
|0001203088| 5.0|
|0001053744| 5.0|
|0001360183| 5.0|
|0001042335| 5.0|
|0001374400| 5.0|
|0001046810| 5.0|
|0001380877| 5.0|
|0001050230| 5.0|
+----------+--------+
only showing top 20 rows

I will continue to research this situation.