Readability of text using odyssey

rajana — Wed, 28 Jun 2017 12:07:19 +0000

Flesch-Kincaid readability test

Flesch Kincaid Grade Level

Gunning Fog Score

SMOG

Coleman Liau Index

Automated Readability Index (ARI)

Recently in a project that we worked on we were asked to find the readability of various pages of a website. We decided to start with Flesch-Kincaid test, as we found this to be a popular one in our research. Flesch-Kincaid readability test is designed to indicate how difficult a passage in English is to understand. In this test higher score indicates how easier to read and a lower score indicates how difficult it is to read.The formula to find Flesch-Kincaid reading-ease score is 206.835 – 1.015 * (total words / total sentences) – 84.6 * (total syllables / total words) The scores can be interrupted as

Score	School Level	Notes
100.00-90.00	5th grade	Very easy to read.
90.00-80.00	6th grade	Easy to read.
80.00-70.00	7th grade	Fairly easy to read.
70.00-60.00	8th & 9th grade	Plain English.
60.00-50.00	10th to 12th grade	Fairly difficult to read.
50.00-30.00	College	Difficult to read.
30.00-0.00	College Graduate	Very difficult to read.

Since we were not experts we wanted the ability to tweak and play around with it. We found an already build gem called Odyssey which had all these various tests and also provided the ability to extend this feature as well. So here in this article, we will discuss how to use Odyssey gem to find readability of an article and a web page.

Install Odyssey

Add in your Gemfile.

gem 'odyssey'

Usage

require 'odyssey'
Odyssey.formula_name(text, all_stats)

So if we want to use the Flesch-Kincaid test, we write the code as below.

require 'odyssey'
Odyssey.flesch_kincaid_re(text, all_stats)

To find the readability of a website we use Nokogiri and Odyssey together. Nokogiri to fetch the contents of the page and Odyssey to get the readability. Example of finding readability of our own website (https://redpanthers.co)/

url = "https://redpanthers.co/"
doc = Nokogiri::HTML(open(url))
# Get all the contents
paragraph = doc.css('p', 'article', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'a').map(&:text)
score = Odyssey.flesch_kincaid_re(para.join('. '), true)
p score

If all_stats is set to false, it returns score only. If it is true returns a hash like below

{
 "string_length"=>3024,
 "letter_count"=>2270,
 "syllable_count"=>808,
 "word_count"=>505,
 "sentence_count"=>75,
 "average_words_per_sentence"=>6.733333333333333,
 "average_syllables_per_word"=>1.6,
 "name"=>"Flesch-Kincaid Reading Ease",
 "formula"=>#,
 "score"=>64.6
}

We can perform multiple text analyses on the same text as shown below

url = "https://redpanthers.co/"
doc = Nokogiri::HTML(open(url))
para = doc.css('p', 'article', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'a').map(&:text)
score = Odyssey.analyze_multi(para.join('. ').gsub('\n', ' '),
          ['FleschKincaidRe', 'FleschKincaidGl', 'GunningFog', 'Smog','Ari','ColemanLiau'],
          true)

if all_stats is set to true it will return a hash like this

{
"string_length"=>19892,
 "letter_count"=>14932,
 "syllable_count"=>5079,
 "word_count"=>3325,
 "sentence_count"=>435,
 "average_words_per_sentence"=>7.64367816091954,
 "average_syllables_per_word"=>1.5275187969924813,
 "scores"=>
  {
   "FleschKincaidRe"=>69.8,
   "FleschKincaidGl"=>5.4,
   "GunningFog"=>3.1,
   "Smog"=>8.7,
   "Ari"=>3.5,
   "ColemanLiau"=>10.6
  }
}

Extending odyssey

To extending odyssey, you can create a class that inherit from formula

class NewFormula < Formula
  def score(passage, stats)
    p passage
    p stats
  end
  def sentence
    "Red Panthers is a Ruby on Rails development studio,
     based in the beautiful city of Cochin."
  end
end

To call your formula you just use

obj = NewFormula.new
Odyssey.new_formula(obj.sentence, false)

Resultant passage will be a Hash like this

{
 "raw"=>"Red Panthers is a Ruby on Rails development studio,
        based in the beautiful city of Cochin.",
 "words"=>["Red", "Panthers", "is", "a", "Ruby", "on", "Rails",
           "development", "studio", "based", "in", "the",
           "beautiful", "city", "of", "Cochin"],
 "sentences"=>["Red Panthers is a Ruby on Rails development studio,
               based in the beautiful city of Cochin."],
 "syllables"=>[1, 2, 1, 1, 2, 1, 1, 4, 2, 1, 1, 1, 4, 2, 1, 2]
}

and resultant status will be a Hash like this

{
 "string_length"=>90,
 "letter_count"=>73,
 "word_count"=>16,
 "syllable_count"=>27,
 "sentence_count"=>1,
 "average_words_per_sentence"=>16.0,
 "average_syllables_per_word"=>1.6875
}

Because we have access to formula’s class that is ‘status’ flag set to true then we have access to other methods or class formula. Thanks to Odyssey we were able to implement the feature quite easily and right now the algorithm we are using have evolved to new forms. But that’s another article. But if you want to build a simple readability checker then it’s quite easy and simple in Rails.

References

]]>

Different types of Index in PostgreSQL

coderhs — Mon, 19 Sep 2016 07:38:45 +0000

here. PostgreSQL uses a different set of algorithm while indexing tables, each type of algorithm is good for a certain set of data. Here we will be discussing the various algorithms available and when we should be using them. (Note these are the algorithms found in PostgreSQL 9.5)

Algorithms

B-Tree (Balance Tree), is the default algorithm used when we build indexes in Rails. It keeps a sorted copy of our column, which would be our index. So if we want to find the row of the word starting with a then as soon as the words starting with a are over. It will stop searching and return null, as the index has kept everything sorted. It is good in most cases, hence it is the default algorithm used. Hash is one of the most popular indexing algorithms. But only the equate operator works on it, thus the query planner will only use an index with a hash algorithm if we do an equal operation searching for it. Another point to note is that Hash index is not WAL (Write Ahead Log) logged, so if the database crash we can’t rebuild the index and would need to REINDEX the entire column. GIN, Generalized Inverted Indexing are great for indexing columns and expressions that contain an array, JSON, JSONB, etc. Internally, a GIN index contains a B-tree index constructed over keys, where each key is an element of one or more indexed items and where each tuple in a leaf page contains either a pointer to a B-tree of heap pointers. GiST, Generalized Search Tree isn’t a single indexing scheme but rather an abstraction that makes it possible to implement indexing schemes for new data types by providing a balanced tree structure access method. In the past building and implementing custom indexing algorithm for custom data types include an understanding of the internals of the database. With the implementation of GiST, it provides an abstraction of the internal working which can be used to build your own indexing algorithm. It uses B-Tree internally, and thus we can use GiST to index IP address, Geo Location, etc. SP-GiST, Space Partitioned Generalized Search Tree – as the name suggest its GiST implementation itself but instead of balance tree structure we can use one of the non-balanced tree structure such as radix tree, quadtree, k-d tree. BRIN, Block Range Indexes are designed to handle very large tables in which the rows’ natural sort order correlates to certain column values. For example, a table storing log entries might have a timestamp column for when each log entry was written. By using a BRIN index on this column, scanning large parts of the table can be avoided when querying rows by their timestamp value with very little overhead. ]]>