Let's say you want to add a search box to your web site to find words within your published content. Or let's say you want to display a list of articles published on your blog, together with snippets of what the article looks like, or a short summary.
In both cases you probably have an .html page or some content from which you want to generate a snippet, just like on this blog: if you visit http://rabexc.org, you can see all articles published recently. Not the whole article, just a short summary of each.
Or if you look to the right of this page, you can see snippets of articles in the same category as this one.
Turns out that generating those snippets in
flask and basic python
library is extremely easy.
So, here are a few ways to do it...
However, this has a few major drawbacks:
Before getting started, don't forget to install all dependencies: this article
python being installed,
BeautifulSoup, and well, there will be
werkzeug. On a Debian system, you need to:
sudo -s apt-get install python-bs4 python-werkzeug
or, in a distro independent way with python installed:
pip install beautifulsoup4 pip install werkzeug
Without getting too fancy, a really simple way to generate a snippet in python is to extract the content of a given article, and well, display it like in the column to the right of this article.
import bs4 page = FunctionThatGetsYourHTMLPage() # Get the content of all the elements in the page. text = bs4.BeautifulSoup(page).getText(separator=" ") # Limit the content to the first 150 bytes, eliminate leading or # trailing whitespace. snippet = text[0:150] # If text was longer than this (most likely), also add '...' if len(text) > 150: snippet += "..."
This code is pretty simple: it loads an HTML page with BeautifulSoup, extracts all the text in between html tags, and then limits it to 150 characters. Note that all the formatting will be lost: no bolds, italics, different fonts, and so on. But this is just what we want: we don't want this formatting in the snippet results.
This is still not the best way to do it, as:
1) It is slow: parsing the html page in python server side is not exactly the fastet thing you can do. With some measurements, it can easily add up and become one of the slowest operations on the site. For example: generating the bar to the right seemed to take several hundreds ms, while just returning the article seemed to take at most tens of ms.
2) You will see anything in between , headers, and so on.
3) If there are lots of whitespaces, you will see those as well in your page.
So, what can we do? Well, here's a few simple things you can do to only get the parts you are interested in:
1) Skip the whitespace:
snippet = " ".join(text.split()).strip()[0:150]
2) Start from the:
text = BeautifulSoup.BeautifulSoup(page).find("body").getText(separator=" ")
3) ... and well, cache, cache and cache the results. Eg, have your code be greedy: don't compute the snippet every time the same snippet has to be displayed. If you use flask and/or werkzeug, you can have something like:
import werkzeug import bs4 class Article(object): @werkzeug.cached_property def html(self): # Reads the html from file, or generates it, or ... @werkzeug.cached_property def snippet(self, limit=150): text = bs4.BeautifulSoup(self.html).getText(separator=" ") snippet = " ".join(text.split()).strip()[0:limit] if len(text) > limit: snippet += "..." return snippet
Don't forget that to get any benefit from the caching above, you also
need to cache
Article objects (by generating them once and keeping
them in a list, ...).
After the changes above, generating the snippet will be literally 6 lines:
def snippet(self, limit=150): text = bs4.BeautifulSoup(self.html).getText(separator=" ") snippet = " ".join(text.split()).strip()[0:limit] if len(text) > limit: snippet += "..." return snippet
The method above with BeautifulSoup works well to extract a snippet of unformatted text from an existing .html file. If you noticed above, all you get is a string containing the text in between tags.
For this site, I found an extremely simple and elegant solution: all articles
are written in MarkDown. Using
markdown library I take a
wiki style text and turn it into html.
MarkDown is simple: parsing it is much easier than parsing .html. To generate
a formatted text I just take the unformatted
wiki like text in MarkDown format,
and break it down after a few paragraphs, using something like:
def Summarize(markdown, limit=1000): """Returns a string with the beginning of a markdown article. Args: markdown: string, containing the original article in markdown format. limit: integer, how many characters of the original article to produce in the summary before starting to look for a good place to stop it. Returns: string, a markdown summary of at least limit length, unless the article is shorter. """ import itertools summary =  count = 0 # Skip titles, we don't want titles in summaries. def ShouldLineBeSkipped(line): if line and line == '#': return True return False # Create an iterator to go over all lines. lines = itertools.ifilter(SholdLineBeSkipped, markdown.splitlines()) # Save all lines until we reach our limit. for line in lines: summary.append(line) count += len(line) if count >= limit: break # Save lines until we find a good place to break the article. for line in lines: # Keep going until what could be the end of a paragraph. if not line.strip() and summary[-1] and summary[-1][-1] == ".": break summary.append(line) # Add an empty line, and bolded '...' at the end of the summary. summary.append("") summary.append("**[ ... ]**") # Finally, return the summary. return "\n".join(summary)
Now, to have a nicely formatted summary of an article, all I have to do is something like:
text = LoadMarkdownTextFromDisk() markdown.markdown(Summary(text), ["codelite", "headerid", "def_list"])
The same suggestion about caching applies here: we could extend our
class to have a
summary cached property producing the formatted summary.
Neither of those methods are perfect: they rely on friendly html pages and markdown for formatting. However, they are extremely simple, a fully general solution would likely be more complex, and well, they seem to work well enough for me :).