Case Study: Use Selenium To Extract A List of Published Medium Article Titles
A helpful utility for high-volume Medium bloggers
Medium authors know that only 20~25 stories (under ‘Drafts’ and ‘Published’) are displayed on your page. When you scroll down to the bottom, 20 or so more stories, will load. This is known as lazy loading (or progressive loading).
For high-volume Medium bloggers, who have hundreds of stories under ‘Published’ and ‘Drafts’, finding a specific article is quite time-consuming. My father asked me to write a simple Selenium utility to help.
I already wrote an article, “Automated Testing Elements on a Lazy Load Page with Selenium WebDriver”, to show how to automate lazy loading pages in Selenium WebDriver. Let’s put it into practice.
Test Design
- Login to Medium account, pass authentication manually
It is not a good idea to store user name and passwords in your scripts. For this utility, I can use TestWise’s attaching-session feature to by-pass authentication manually. - Keep scrolling down until reaching the end.
For this, scroll to a big “enough” number. It isn’t necessary to guarantee the end is reached.
This will take a while, treat yourself to a cup of coffee while it is running. - Extract story titles
Easy with Selenium WebDriver. - Save all story titles into a text file.
Easy with Ruby.
Steps
1. Preparation
Create a Selenium-RSpec project in TestWise.
2. Run the empty test case
This purpose of doing this is to get a Chrome browser session that TestWise can attach to.
Please note, right-click a line within the test case and select the first Run “…”
option. This is called “individual test execution mode” in TestWise.
You shall see a Chrome browser session start, opening the Medium home page.
3. Run the the empty test case
Type in driver.find_element(:link_text, “Sign in”).click
in the test case, select that line, then right-click and choose theRun Selected Script Against Current Browser
option:
This enters TestWise’s “debugging mode”.
You will see the same browser window from before open the sign in pop up.
4. Log in manually
It is bad practice to put authentication details in E2E test scripts (in plain text). And for many external websites , there are Captchas to prevent automation anyway.
There is a workaround, that is, TestWise’s ability to attach to browser session.
Below is after logged-in (my father’s account, which has a lot more articles, which is better for this case study).
5. Get one story title in Selenium Script
Manually navigate to your published articles, https://medium.com/me/stories/public
, right click any article title and inspect it in Chrome.
A good locator for getting the story title is using XPath. And since there are more than one published stories, use find_elements
to get the list of all stories:
# get the second element's title
driver.find_elements(:xpath, "//a[@data-testid='postTitle']")[2].text
Then, type the two statements (above) in the special debugging_spec.rb
, then run it.
6. Handle the lazy loading, Scrolling to the end
Let’s use the approach from my previous article — in a loop, use keyboard controls to jump/scroll down the page. This should trigger the lazy load to load the next batch of articles.
10.times do |x| # 10 is arbitrary, in the real progra,, we want this to be huge (e.g. 100).
puts "Scroll: #{x + 1}"
driver.find_element(:tag_name, "body").send_keys(:end)
sleep 1 # allow time for next articles to load
end
Keen readers might notice that I’m using the “End” key, not the “Page Down” key to scroll like I did in the previous article. The End key is a lot more efficient for jumping down a page.
For my 134 published articles, the script takes 24 “page downs” to load them all, and only 11 “ends”!
7. Extract all story titles
Using Selenium find_elements
to get all matching story elements, then get the post titles for each element with Ruby’s collect
.
story_elems = driver.find_elements(:xpath, "//a[@data-testid='postTitle']")
story_titles = story_elems.collect{|x| x.text}
puts "Total #{story_titles.count} stories"
8. Save to a file
This is a standard Ruby write to file scenario:
fio = File.open(File.dirname(__FILE__) + "/../story_titles.txt", "w")
story_titles.each do |title|
fio.puts(title)
end
fio.flush
fio.close
The result:
9. Minor Optimization — Stop scrolling after all articles have loaded
When you run the script, the time spent scrolling down the page is very noticeable. And since we picked an arbitrarily large number of times to scroll (e.g. 100), we might keep scrolling even after all the articles have loaded.
We can do some minor optimisation to stop scrolling after there are no more articles left to load (i.e. no more new articles appeared after a scroll).
Before:
100.times do |x|
puts("Scroll: #{x + 1}")
driver.find_element(:tag_name, "body").send_keys(:end)
sleep 1
end
After:
story_elems = driver.find_elements(:xpath, "//a[@data-testid='postTitle']")
100.times do |x|
puts("Scroll: #{x + 1}")
driver.find_element(:tag_name, "body").send_keys(:end)
sleep 1
# check if any new articles have loaded
new_story_elems = driver.find_elements(:xpath, "//a[@data-testid='postTitle']")
if (story_elems.count == new_story_elems.count) # no more articles left to load
break # stop looping
else
story_elems = new_story_elems
end
end
This dropped my total scroll count from 100 to only 11!
Full Test Script
load File.dirname(__FILE__) + "/../test_helper.rb"
describe "Test Suite" do
include TestHelper
before(:all) do
# browser_type, browser_options, site_url are defined in test_helper.rb
@driver = $driver = Selenium::WebDriver.for(browser_type, browser_options)
driver.manage().window().resize_to(1280, 800)
driver.get(site_url)
end
after(:all) do
driver.quit unless debugging?
end
it "Login and Extract Published Story Titles" do
driver.find_element(:link_text, "Sign in").click
# MANUAL LOGIN, then run the below in TestWise Debugging mode
visit("/me/stories/public")
# the statement below is testing, optional
story_elems = driver.find_elements(:xpath, "//a[@data-testid='postTitle']")
100.times do |x|
puts("Scroll: #{x + 1}")
driver.find_element(:tag_name, "body").send_keys(:end)
sleep 1
# check if any new articles have loaded
new_story_elems = driver.find_elements(:xpath, "//a[@data-testid='postTitle']")
if (story_elems.count == new_story_elems.count) # no more articles left to load
break
else
story_elems = new_story_elems
end
end
story_titles = story_elems.collect { |x| x.text }
puts "Total #{story_titles.count} stories"
# write to file
fio = File.open(File.dirname(__FILE__) + "/../story_titles.txt", "w")
story_titles.each do |title|
fio.puts(title)
end
fio.flush
fio.close
end
end