Case Study: Add Business Intelligence to Extracted Medium Article Titles

Combining the power and simplicity of both Selenium and Ruby

Courtney Zhan
4 min readAug 23, 2024

From the previous case study, I created an automation utility (using Selenium WebDriver in TestWise) to extract Medium published article titles.

We can take further steps to improve this case study by incorporating business intelligence for draft articles. The tasks will include:

  • Sorting the article titles alphabetically
  • Extracting the word count and last edited date
  • Extracting the links (URLs)
  • Saving all the data in a spreadsheet format.

The aim is to make it easier to view drafted articles in a spreadsheet, like macOS’ Numbers or Microsoft Excel.

Test Design

The first two steps are covered in the previous case story.

  1. Login to Medium account, pass authentication manually
  2. Keep scrolling down until reaching the end.
  3. Extract article title, last edited time and word account for each draft story
    Easy with Selenium WebDriver, using XPath and chained locators.
  4. Extract the data for all articles.
  5. Process data
    Basic programming.
  6. Save to a CSV file
    Easy with Ruby.

Steps

1. Preparation

Start a Chrome browser session from executing a test case in TestWise;
Log in (manually)
Keep scrolling down to the end (no more stories)

Then, run the following test steps in TestWise debugging mode.

2. Locating a User Story Title Element

This time (compared to the previous case study), I need to extract more data. First, starting with the story title element, the clearly visible one.

That is the <a> link.

story_title_elem = driver.find_element(:xpath, "//a[@data-testid='postTitle']")

We can extract the story title and its link (URL)

title = story_title_elem.text
href = story_title_elem["href"]

3. Locating a User Story Element

We need more info, such as last edited time and word count (indicated below).

We need to trace back, from the story title element, to the whole story element. It is quite easy to do, using Selenium’s chained locator, after a few tries. (this shows the great benefits of TestWise’s debugging mode)

  story_elem = story_title_elem.find_element(:xpath, "../../..")

4. Extract the story’s extra data, using regular expression

Using element.text to get all visible text of a story element, which includes title, last edited time, word count and others.

We extract the last_edited and word_count out, using regular expression.

story_elem = story_title_elem.find_element(:xpath, "../../..")
story_section_text = story_elem.text.gsub("\n", " ")
story_section_text = story_section_text.gsub("\r\n", " ")
if story_section_text =~ /Last\sedited\s(.+)\sago.+\((\d+)\swords\)/
last_edited = $1
word_count = $2
end

5. Normalize the data

Medium show time like these: “over 3 years”, “about 1 months ago”, “2 hours ago”. We need to remove over , about first.

 last_edited = last_edited.gsub("about ", "").
gsub("over ", "").gsub("almost ", "").
gsub(" ", ".") + ".ago"

How about get the actual date? Using the ActiveSupport library, which can convert '1.month.ago' to 2024-07-15 (taking today as 2023–08–15 as example).

 last_edited_time = eval(last_edited)
last_edited_time = last_edited_time.strftime("%Y-%m-%d")

6. Do for all stories

The above is for just one story. We need to collect data for all stories. Use Selenium find_elements , and process one by one.

 story_title_elem_list = driver.find_elements(:xpath, "//a[@data-testid='postTitle']")

story_data = []
story_title_elem_list.each do |story_title_elem|
# ...
story_elem = story_title_elem.find_element(:xpath, "../../..")
# ...

story_data << [title, last_edited_time, word_count, href]
end

I store the data into an array. The below statement sorts all story data by title.

  story_data.sort! { |a, b| a[0] <=> b[0] }

7. Write to a CSV File

Ruby comes with built-in CSV library.

    csv_file = File.join(File.dirname(__FILE__), "..", "draft_stories.csv")
CSV.open(csv_file, "w") do |csv|
csv << ["Title", "Last Edited", "Word Count", "Link"]
story_data.each do |sd|
csv << sd
end
end

Output

Below is the outputed CSV file opened in macOS’ Numbers.

The same file is opened fine in Excel.

Full Test Script

load File.dirname(__FILE__) + "/../test_helper.rb"
require "active_support/all"
require "csv"

describe "Case Study 2: Extract All Medium Draft Stories to a Spreadsheet" do
include TestHelper

before(:all) do
# browser_type, browser_options, site_url are defined in test_helper.rb
@driver = $driver = Selenium::WebDriver.for(browser_type, browser_options)
driver.manage().window().resize_to(1280, 800)
driver.get(site_url)
end

after(:all) do
driver.quit unless debugging?
end

it "Sort and save to spreadsheet" do
driver.find_element(:link_text, "Sign in").click

# MANUAL LOGIN, then run the below in TestWise Debugging mode
break

visit("/me/stories/drafts")

# the statement below is testing, optional
story_title = driver.find_element(:xpath, "//a[@data-testid='postTitle']").text
story_title_elem = driver.find_element(:xpath, "//a[@data-testid='postTitle']")

story_title_elem_list = driver.find_elements(:xpath, "//a[@data-testid='postTitle']")
story_data = []
story_title_elem_list.each do |story_title_elem|
href = story_title_elem["href"]
href.gsub!("?source=your_stories_page-------------------------------------", "")
story_elem = story_title_elem.find_element(:xpath, "../../..")
story_section_text = story_elem.text.gsub("\n", " ")
story_section_text = story_section_text.gsub("\r\n", " ")

fio.puts href

title = story_title_elem.text
if story_section_text =~ /Last\sedited\s(.+)\sago.+\((\d+)\swords\)/
last_edited = $1
word_count = $2

last_edited = last_edited.gsub("about ", "").gsub("over ", "").gsub("almost ", "").gsub(" ", ".") + ".ago"
fio.puts("|#{last_edited}|")
last_edited_time = eval(last_edited)
last_edited_time = last_edited_time.strftime("%Y-%m-%d")
end

story_data << [title, last_edited_time, word_count, href]
end

story_data.sort! { |a, b| a[0] <=> b[0] }

csv_file = File.join(File.dirname(__FILE__), "..", "draft_stories.csv")
CSV.open(csv_file, "w") do |csv|
csv << ["Title", "Last Edited", "Word Count", "Link"]
story_data.each do |sd|
csv << sd
end
end

end
end

--

--