Case Study: Add Business Intelligence to Extracted Medium Article Titles
Combining the power and simplicity of both Selenium and Ruby
From the previous case study, I created an automation utility (using Selenium WebDriver in TestWise) to extract Medium published article titles.
We can take further steps to improve this case study by incorporating business intelligence for draft articles. The tasks will include:
- Sorting the article titles alphabetically
- Extracting the word count and last edited date
- Extracting the links (URLs)
- Saving all the data in a spreadsheet format.
The aim is to make it easier to view drafted articles in a spreadsheet, like macOS’ Numbers or Microsoft Excel.
Test Design
The first two steps are covered in the previous case story.
- Login to Medium account, pass authentication manually
- Keep scrolling down until reaching the end.
- Extract article title, last edited time and word account for each draft story
Easy with Selenium WebDriver, using XPath and chained locators. - Extract the data for all articles.
- Process data
Basic programming. - Save to a CSV file
Easy with Ruby.
Steps
1. Preparation
Start a Chrome browser session from executing a test case in TestWise;
Log in (manually)
Keep scrolling down to the end (no more stories)
Then, run the following test steps in TestWise debugging mode.
2. Locating a User Story Title Element
This time (compared to the previous case study), I need to extract more data. First, starting with the story title element, the clearly visible one.
That is the <a>
link.
story_title_elem = driver.find_element(:xpath, "//a[@data-testid='postTitle']")
We can extract the story title and its link (URL)
title = story_title_elem.text
href = story_title_elem["href"]
3. Locating a User Story Element
We need more info, such as last edited time and word count (indicated below).
We need to trace back, from the story title element, to the whole story element. It is quite easy to do, using Selenium’s chained locator, after a few tries. (this shows the great benefits of TestWise’s debugging mode)
story_elem = story_title_elem.find_element(:xpath, "../../..")
4. Extract the story’s extra data, using regular expression
Using element.text
to get all visible text of a story element, which includes title, last edited time, word count and others.
We extract the last_edited and word_count out, using regular expression.
story_elem = story_title_elem.find_element(:xpath, "../../..")
story_section_text = story_elem.text.gsub("\n", " ")
story_section_text = story_section_text.gsub("\r\n", " ")
if story_section_text =~ /Last\sedited\s(.+)\sago.+\((\d+)\swords\)/
last_edited = $1
word_count = $2
end
5. Normalize the data
Medium show time like these: “over 3 years”, “about 1 months ago”, “2 hours ago”. We need to remove over
, about
first.
last_edited = last_edited.gsub("about ", "").
gsub("over ", "").gsub("almost ", "").
gsub(" ", ".") + ".ago"
How about get the actual date? Using the ActiveSupport library, which can convert '1.month.ago'
to 2024-07-15
(taking today as 2023–08–15 as example).
last_edited_time = eval(last_edited)
last_edited_time = last_edited_time.strftime("%Y-%m-%d")
6. Do for all stories
The above is for just one story. We need to collect data for all stories. Use Selenium find_elements
, and process one by one.
story_title_elem_list = driver.find_elements(:xpath, "//a[@data-testid='postTitle']")
story_data = []
story_title_elem_list.each do |story_title_elem|
# ...
story_elem = story_title_elem.find_element(:xpath, "../../..")
# ...
story_data << [title, last_edited_time, word_count, href]
end
I store the data into an array. The below statement sorts all story data by title.
story_data.sort! { |a, b| a[0] <=> b[0] }
7. Write to a CSV File
Ruby comes with built-in CSV library.
csv_file = File.join(File.dirname(__FILE__), "..", "draft_stories.csv")
CSV.open(csv_file, "w") do |csv|
csv << ["Title", "Last Edited", "Word Count", "Link"]
story_data.each do |sd|
csv << sd
end
end
Output
Below is the outputed CSV file opened in macOS’ Numbers.
The same file is opened fine in Excel.
Full Test Script
load File.dirname(__FILE__) + "/../test_helper.rb"
require "active_support/all"
require "csv"
describe "Case Study 2: Extract All Medium Draft Stories to a Spreadsheet" do
include TestHelper
before(:all) do
# browser_type, browser_options, site_url are defined in test_helper.rb
@driver = $driver = Selenium::WebDriver.for(browser_type, browser_options)
driver.manage().window().resize_to(1280, 800)
driver.get(site_url)
end
after(:all) do
driver.quit unless debugging?
end
it "Sort and save to spreadsheet" do
driver.find_element(:link_text, "Sign in").click
# MANUAL LOGIN, then run the below in TestWise Debugging mode
break
visit("/me/stories/drafts")
# the statement below is testing, optional
story_title = driver.find_element(:xpath, "//a[@data-testid='postTitle']").text
story_title_elem = driver.find_element(:xpath, "//a[@data-testid='postTitle']")
story_title_elem_list = driver.find_elements(:xpath, "//a[@data-testid='postTitle']")
story_data = []
story_title_elem_list.each do |story_title_elem|
href = story_title_elem["href"]
href.gsub!("?source=your_stories_page-------------------------------------", "")
story_elem = story_title_elem.find_element(:xpath, "../../..")
story_section_text = story_elem.text.gsub("\n", " ")
story_section_text = story_section_text.gsub("\r\n", " ")
fio.puts href
title = story_title_elem.text
if story_section_text =~ /Last\sedited\s(.+)\sago.+\((\d+)\swords\)/
last_edited = $1
word_count = $2
last_edited = last_edited.gsub("about ", "").gsub("over ", "").gsub("almost ", "").gsub(" ", ".") + ".ago"
fio.puts("|#{last_edited}|")
last_edited_time = eval(last_edited)
last_edited_time = last_edited_time.strftime("%Y-%m-%d")
end
story_data << [title, last_edited_time, word_count, href]
end
story_data.sort! { |a, b| a[0] <=> b[0] }
csv_file = File.join(File.dirname(__FILE__), "..", "draft_stories.csv")
CSV.open(csv_file, "w") do |csv|
csv << ["Title", "Last Edited", "Word Count", "Link"]
story_data.each do |sd|
csv << sd
end
end
end
end