Different Approaches to Filtering a CSV file
Using the best tool you have for the job
At work, I was tasked to extract a column data from a CSV File, like below:
The purpose was to get the number of unique course IDs that was being offered. For the example above, there are three (1
,2
and 3
).
I used an one-line command to accomplish the task:
% cut -f 1 -d "," enrolments.csv | tail -n+2 $1 | sort | uniq | wc -l
The output is 3
.
Script Design
- Remove the header line (
CourseID …
)
This can be achieved using unix’stail n+2 <file>
- Extract the first column only
Using unix commandcut -f 1 -d “,”
.-f
is the column number to keep (1 => first column),-d
specifies the delimiter (in a CSV, a comma). - Remove the duplicate rows
In unix usesort
followed byuniq
.uniq
only removes the neighbouring duplicates, so you mustsort
first. - Count the result
In unix usewc -l
.-l
means count the number of lines returned.
The unix pipes (|
), pass the result to the next command.
Alternative Approaches
A junior developer on our team also had the same task, and they wrote a Python script to remove the duplicates and count the course IDs. This was not a bad solution; we went to the same university and I fully understand why they took this approach. And a senior engineer on our team that saw my approach recommended that I try out xcut (an enhanced cut
) instead as well!
My father told me story about a .NET project he worked on. The tech lead, a senior .NET contractor, proposed a solution for a task that could have been completed with a simple one-line shell command. Instead, the suggestion was to develop a C# application that would allow users to select an input CSV file, process it, and save the output to a destination file!
This story tells us that software engineers, like many other professionals other fields, should have various tools in their tool box, and use the best one for job.