A simple way to generate snippets in python

The Problem

Let's say you want to add a search box to your web site to find words within your published content. Or let's say you want to display a list of articles published on your blog, together with snippets of what the article looks like, or a short summary.

In both cases you probably have an .html page or some content from which you want to generate a snippet, just like on this blog: if you visit http://rabexc.org, you can see all articles published recently. Not the whole article, just a short summary of each.

Or if you look to the right of this page, you can see snippets of articles in the same category as this one.

Turns out that generating those snippets in python using flask and basic python library is extremely easy.

So, here are a few ways to do it...

Using javascript

Before starting to talk about python, I should mention that doing this in javascript should be extremely easy and straightforward. In facts, you can find many sites that load entire articles and then magically hide portions of it using javascript.

However, this has a few major drawbacks:

1) Unless you compute the snippets server side, the browser of your user will still receive the whole article so that javascript can chop it up and display only a piece of it.

2) The content will not necessarily be search engine friendly. If you embed the whole content on the page, a search engine may return the index page rather than the article when one of your user looks up an obscure word. Worse: this word and the interesting context may end up being hidden by your javascript, leading to an overall bad experience for your users. And if you decide to go the restful way, with javascript fetching the page via some API, the search engine is unlikely to see the content at all.

Nonetheless, some sites use javascript. Nonetheless, I will only talk about how to do this server side, in python.

Before getting started, don't forget to install all dependencies: this article depends on python being installed, BeautifulSoup, and well, there will be references to werkzeug. On a Debian system, you need to:

sudo -s 
apt-get install python-bs4 python-werkzeug

or, in a distro independent way with python installed:

pip install beautifulsoup4
pip install werkzeug

Using BeautifulSoup

Without getting too fancy, a really simple way to generate a snippet in python is to extract the content of a given article, and well, display it like in the column to the right of this article.

For example:

import bs4
page = FunctionThatGetsYourHTMLPage()

# Get the content of all the elements in the page. 
text = bs4.BeautifulSoup(page).getText(separator=" ")

# Limit the content to the first 150 bytes, eliminate leading or
# trailing whitespace.
snippet = text[0:150]

# If text was longer than this (most likely), also add '...'
if len(text) > 150:
  snippet += "..."

This code is pretty simple: it loads an HTML page with BeautifulSoup, extracts all the text in between html tags, and then limits it to 150 characters. Note that all the formatting will be lost: no bolds, italics, different fonts, and so on. But this is just what we want: we don't want this formatting in the snippet results.

This is still not the best way to do it, as:

1) It is slow: parsing the html page in python server side is not exactly the fastet thing you can do. With some measurements, it can easily add up and become one of the slowest operations on the site. For example: generating the bar to the right seemed to take several hundreds ms, while just returning the article seemed to take at most tens of ms.

2) You will see anything in between , headers, and so on.

3) If there are lots of whitespaces, you will see those as well in your page.

So, what can we do? Well, here's a few simple things you can do to only get the parts you are interested in:

1) Skip the whitespace:

snippet = " ".join(text.split()).strip()[0:150]

2) Start from the :

text = BeautifulSoup.BeautifulSoup(page).find("body").getText(separator=" ")

3) ... and well, cache, cache and cache the results. Eg, have your code be greedy: don't compute the snippet every time the same snippet has to be displayed. If you use flask and/or werkzeug, you can have something like:

import werkzeug
import bs4

class Article(object):
  @werkzeug.cached_property
  def html(self):
    # Reads the html from file, or generates it, or ...

  @werkzeug.cached_property
  def snippet(self, limit=150):
    text = bs4.BeautifulSoup(self.html).getText(separator=" ")
    snippet = " ".join(text.split()).strip()[0:limit]

    if len(text) > limit:
      snippet += "..."

    return snippet

Don't forget that to get any benefit from the caching above, you also need to cache Article objects (by generating them once and keeping them in a list, ...).

After the changes above, generating the snippet will be literally 6 lines:

def snippet(self, limit=150):
  text = bs4.BeautifulSoup(self.html).getText(separator=" ")
  snippet = " ".join(text.split()).strip()[0:limit]
  if len(text) > limit:
    snippet += "..."
  return snippet

Maintaining formatted text

The method above with BeautifulSoup works well to extract a snippet of unformatted text from an existing .html file. If you noticed above, all you get is a string containing the text in between tags.

I could not find any really easy way to extract text while maintaining some formatting, except for whitelisting some html tags and blacklisting others. But doing so in a very generic way gets tricky: what about stylesheets? What about some javascript formatting and magic? what if I have a large table? Images? ...

For this site, I found an extremely simple and elegant solution: all articles are written in MarkDown. Using the markdown library I take a wiki style text and turn it into html.

MarkDown is simple: parsing it is much easier than parsing .html. To generate a formatted text I just take the unformatted wiki like text in MarkDown format, and break it down after a few paragraphs, using something like:

def Summarize(markdown, limit=1000):
  """Returns a string with the beginning of a markdown article.

  Args:
    markdown: string, containing the original article in markdown format.
    limit: integer, how many characters of the original article to produce
      in the summary before starting to look for a good place to stop it.

  Returns:
    string, a markdown summary of at least limit length, unless the article
    is shorter.
  """
  import itertools

  summary = []
  count = 0

  # Skip titles, we don't want titles in summaries.
  def ShouldLineBeSkipped(line):
    if line and line[0] == '#':
      return True
    return False

  # Create an iterator to go over all lines.
  lines = itertools.ifilter(SholdLineBeSkipped, markdown.splitlines())

  # Save all lines until we reach our limit.
  for line in lines:
    summary.append(line)

    count += len(line)
    if count >= limit:
      break

  # Save lines until we find a good place to break the article.
  for line in lines:
    # Keep going until what could be the end of a paragraph.
    if not line.strip() and summary[-1] and summary[-1][-1] == ".":
      break
    summary.append(line)

  # Add an empty line, and bolded '...' at the end of the summary.
  summary.append("")
  summary.append("**[ ... ]**")

  # Finally, return the summary.
  return "\n".join(summary)

Now, to have a nicely formatted summary of an article, all I have to do is something like:

text = LoadMarkdownTextFromDisk()
markdown.markdown(Summary(text), ["codelite", "headerid", "def_list"])

The same suggestion about caching applies here: we could extend our Article class to have a summary cached property producing the formatted summary.

Conclusions

Neither of those methods are perfect: they rely on friendly html pages and markdown for formatting. However, they are extremely simple, a fully general solution would likely be more complex, and well, they seem to work well enough for me :).

Recovering from a failed SSD on linux When cardboard boxes are better than suitcases