Case Study: Automation Script to Extract the Top 10 Authors featured in the Software Testing Newsletters

To verify my father’s claim that he was the “most featured author in the leading software testing newsletters”.

Courtney Zhan
8 min readFeb 11, 2023

--

In a recent article, my father made a guess that he was “probably the most featured author in the leading software testing newsletters”. In this article, I will write an automation script to verify this claim, by extracting author names from past issues (since 2021) of the Software Testing Weekly and Coding Jag, both are widely regarded as one of the best software testing newsletters.

Table of Contents
· Analyse
1. To Extract the Author Names in the description of each article.
2. There can be more than one author within an article.
3. Narrow down the sections
4. Remove article links.
5. Filter out by exclusion words
·
Execution
Counting
Put it all together: analyse all 98 issues over the past 2 years
Charting
·
My father featured count in Coding Jag
·
Summary
·
Full Test Script
·
Zhimin’s Notes

Analyse

I start with one Software Testing Weekly issue (#153, the latest at the time).

1. To Extract the Author Names in the description of each article.

from software testing weekly issue #153 (2023–01–29)

The author’s name is linked underneath the article title.

There are no identifiable attributes, e.g. <a class='author' , for author links. This means the locating strategy is to extract the paragraph links with //p/a xpath or similar.

2. There can be more than one author within an article.

For the above example,

There are three links here, two are for authors (Zhimin Zhan and Malith Senadheera), and one is for another related article.

3. Narrow down the sections

The whole page structure is as below.

I started with this,

driver.find_elements("//div[@class='issue__body']//p/a")

to get all links under articles.

The results have too much noise, such as the sponsored and general links. I need to narrow it down to the relevant sections, defined in an array (in Ruby).

stw_categories = %w(cc-news cc-automation cc-toools cc-books cc-videos)

Use the script below to get the links in the specified sections, then combine them.

stw_categories.each do |category| 
section_links = driver.find_elements(:xpath,
"//section[@class='category #{category}']//div/p/a")
# ...
end

4. Remove article links.

I am only interested in author names, besides that, there may be other article links. To filter them out, I made a crude assumption that authors’ names are less than or equal to three words long. Everything else gets filtered out.

next if the_link_text.split.size > 3

The split method returns an array of words from a string.

5. Filter out by exclusion words

There are may be short article names or other links. So I define an exclusion list.

exclude_words = ["this Reddit thread", "Test Model", "Architect of Quality"]

If the link text contains any of them, exclude it.

next if exclude_words.any?{|x| the_link_text.include?(x) }

Execution

Here is the test script for analysing one issue (#153).

it "Extract authors in Software Testing Weekly #153" do
driver.get("https://softwaretestingweekly.com/issues/153")
stw_categories = %w(cc-news cc-automation cc-toools cc-tools cc-books cc-videos)
exclude_words = ["this Reddit thread", "Test Model", "Architect of Quality"]
category_links = []
stw_categories.each do |category|
section_links = driver.find_elements(:xpath, "//section[@class='category #{category}']//div/p/a")
section_links.each do |one_link|
the_link_text = one_link.text
next if the_link_text.split.size > 3
next if exclude_words.any? { |x| the_link_text.include?(x) }
category_links << one_link
end
end
author_names = category_links.collect { |elem| elem.text }
puts "\n" + author_names.size.to_s + " in total"
end

Of course, I did not get the above in one go. I tried and worked it out step by step in TestWise, using its wonderful “debugging mode” (attaching test execution to the existing browser, no need to restart from the beginning to try out a new test step, a huge time saving and keep the momentum). So it did not take long, maybe 15 minutes (including analyse time), to get it done.

Running one test step in TestWise debugging mode.

The output:

Antoine Craske 
Ricardo Bedin
Alan Richardson
Daniel Lehner
Maciej Rojek
Martin Ivison
Ioan Solderea
John Ferguson Smart
Elizabeth Zagroba
Jeff Cechinel
Paul de Witt
Criss Chan
Zhimin Zhan
Lutfi Fitroh Hadi
Dan Neciu
Debojyoti Chatterjee
Zhimin Zhan
Malith Senadheera
Nikola Dimic
Mike Harris
Jennifer Columbe
John Miller

22 authors in Issue #153. Compared to the issue, that looks about right. Please note, 100% accuracy is not what I am aiming for, as I am only interested in the top authors.

Counting

In issue 153, my father’s name, “Zhimin Zhan”, appeared twice. We need to count authors by their number of occurrences. This is very easy to do in Ruby!

puts author_names.tally

The output:

{"Antoine Craske"=>1, "Ricardo Bedin"=>1, "Alan Richardson"=>1, 
"Daniel Lehner"=>1, "Maciej Rojek"=>1, "Martin Ivison"=>1,
"Ioan Solderea"=>1, "John Ferguson Smart"=>1, "Elizabeth Zagroba"=>1,
"Jeff Cechinel"=>1, "Paul de Witt"=>1, "Criss Chan"=>1,
"Zhimin Zhan"=>2, "Lutfi Fitroh Hadi"=>1, "Dan Neciu"=>1,
"Debojyoti Chatterjee"=>1, "Malith Senadheera"=>1, "Nikola Dimic"=>1,
"Mike Harris"=>1, "Jennifer Columbe"=>1, "John Miller"=>1}

To sort by occurrences.

sorted = author_names.tally.sort_by(&:last)

The output:

[["John Miller", 1],
["Ricardo Bedin", 1],
["Alan Richardson", 1],
["Daniel Lehner", 1],
["Maciej Rojek", 1],
["Martin Ivison", 1],
["Ioan Solderea", 1],
["John Ferguson Smart", 1],
["Elizabeth Zagroba", 1],
["Jeff Cechinel", 1],
["Paul de Witt", 1],
["Criss Chan", 1],
["Antoine Craske", 1],
["Lutfi Fitroh Hadi", 1],
["Dan Neciu", 1],
["Debojyoti Chatterjee", 1],
["Malith Senadheera", 1],
["Nikola Dimic", 1],
["Mike Harris", 1],
["Jennifer Columbe", 1],
["Zhimin Zhan", 2]]

To sort from high to low order, reverse with: sorted.reverse! , to get

[["Zhimin Zhan", 2],
["Jennifer Columbe", 1],
...
]

To get the top 10.

top_10 = sorted[..9]

Put it all together to analyse all 98 issues over the past 2 years

My father started blogging on January, 27, 2021. The issue for that time is #56. So, I add the looping to analyse these 98 issues.

author_names = []
(56..153).each do |issue_no|
puts "Issue: #{issue_no}"
driver.get("https://softwaretestingweekly.com/issues/#{issue_no}")

# ... see above to extract one
# ...
author_names << the_link_text

sleep 1 # don't hit the server too hard
end

Note: I added a sleep of 1 second in between loading each issue to prevent spamming the server too much.

The result:

[
["Dennis Martinez", 40],
["Zhimin Zhan", 37],
["Antoine Craske", 37],
["Maaret Pyh\u00E4j\u00E4rvi", 30],
["Gleb Bahmutov", 28],
["Pramod Dutta", 24],
["Michael Bolton", 18],
["Mike Harris", 18],
["Callum Akehurst-Ryan", 16],
["Gil Zilberfeld", 16]
]

So, my father is the second, not the top one.

I quickly checked a few issues and found a high percentage of articles by “Dennis Martinez” and “Antoine Craske” are under the “News” category, which probably won’t fit in the scope of the claim. If I excluded that,`stw_categories = %w(cc-automation cc-toools cc-tools cc-books cc-videos)` , the result would be:

[
["Zhimin Zhan", 32],
["Gleb Bahmutov", 28],
["Dennis Martinez", 27],
["Pramod Dutta", 24],
["Gil Zilberfeld", 14],
["Oleksandr Romanov", 12],
["Filip Hric", 12],
["Paul Grizzaffi", 11],
["Marie Drake", 11],
["NaveenKumar Namachivayam", 11]
]

On this measure, my father is the top. Anyway, to be 100% neutral, I will go with the first result (including News, where my father ranked №.2) for Software Testing Weekly.

Charting

My father’s featured count in Coding Jag

I also tried to create an automation script to do the same for another leading software testing newsletter: Coding Jag. However, author names are not shown in Coding Jag.

A sample article in Coding Jag.

So, extracting all authors for comparison is not possible, but I can count the total number of my father’s articles featured there, from Issue 22 to 125 (the same period).

The script below is the main logic for counting unique articles containing`zhiminzhan` in the article links. The full script is listed in a later section.

 links = driver.find_elements(:tag_name, "a")
link_texts = links.collect { |x| x["href"] }
zhimin_links = link_texts.compact.select { |y| y.include?("zhiminzhan") }.uniq
zhimin_total_count += zhimin_links.count

The results:

Total number of articles by Zhimin Zhan on Coding Jag: 60

Summary

Coding Jag featured my father’s articles more than Software Testing weekly, 62% more for the same period.

With all the info above, my father’s claim is mostly correct.

Full Test Script

  1. Software Testing Weekly
require 'rspec'
require 'selenium-webdriver'

describe "Analyse Popular Authors In Software Testing Newsletters" do
before(:all) do
@driver = Selenium::WebDriver.for(:chrome)
driver.manage().window().resize_to(1280, 720)
end
after(:all) do
driver.quit
end

def driver
@driver
end

it "Extract authors in Software Testing Weekly #56 to #153" do
stw_categories = %w(cc-news cc-automation cc-toools cc-tools cc-books cc-videos)
exclude_words = ["this Reddit thread", "Test Model", "Architect of Quality", "k6", "Cypress", "Playwright", "Postman"]
author_names = []

# 56
(56..153).each do |issue_no|
puts "Issue: #{issue_no}"
driver.get("https://softwaretestingweekly.com/issues/#{issue_no}")
sleep 0.5
stw_categories.each do |category|
section_links = driver.find_elements(:xpath, "//section[@class='category #{category}']//div/p/a")
section_links.each do |one_link|
the_link_text = one_link.text
next if the_link_text.split.size > 3
next if exclude_words.any? { |x| the_link_text.include?(x) }
author_names << the_link_text
end
end
sleep 1 # don't hit the server too hard
end
puts "\n" + author_names.size.to_s + " in total"
metrics = author_names.tally
sorted = metrics.sort_by { |_key, value| value }
sorted.reverse! # => the most poplular first
top_10 = sorted[..9]

#
File.open("/tmp/stw_authors.txt", "w").puts(sorted.inspect) if RUBY_PLATFORM =~ /darwin/
puts top_10.inspect
end
end

To run it (after ruby and the libraries are installed, gem install rspec selenium-webdriver ), run the above script from the command line.

> rspec analyse_stw_top_authors_spec.rb

2. Coding Jag

it "Coding Jag" do
first_issue = 22
latest_issue = 125

zhimin_total_count = 0
(first_issue..latest_issue).each do |issue_no|
driver.get("https://www.lambdatest.com/newsletter/editions/issue#{issue_no}")
sleep 0.5
links = driver.find_elements(:tag_name, "a")
link_texts = links.collect { |x| x["href"] }
zhimin_links = link_texts.compact.select { |y| y.include?("zhiminzhan") }.uniq
zhimin_total_count += zhimin_links.count
sleep 1
end
puts zhimin_total_count
end

--

--