How to create a Text Analyzer in Ruby?

These are my notes about Chapter 4 from the book Beginning Ruby: From Novice to Professional.A book highly recommended for dummies. I found it via A Path to Learn Rails 4 properly.

Description

This code will read in text supplied in a separate file, analyze it for various patterns and statistics, and print out the results for the user.

Required Basic Features

  • Character count
  • Character count (excluding spaces)
  • Line count
  • Word count
  • Sentence count
  • Paragraph count
  • Average number of words per sentence
  • Average number of sentences per paragraph

Building the Basic Application

Let’s outline the basic steps as follows:

  1. Obtain some dummy text
  2. Load in a file containing the text or document you want to analyze.
  3. As you load the file line by line, keep a count of how many lines there were.
  4. Put the text into a string and measure its length to get your character count.
  5. Temporarily remove all whitespace and measure the length of the resulting string to get the character count excluding spaces.
  6. Split out all the whitespace to find out how many words there are.
  7. Split on full stops to find out how many sentences there are.
  8. Split on double newlines to find out how many paragraphs there are.
  9. Perform calculations to work out the averages. Create a new, blank Ruby source file and save it as analyzer.rb in your Ruby folder. As you work through the next few sections, you’ll be able to fill it out.

1. Obtain some dummy text

The dummy file must be within the same folder where you will save example1.rb, and call it text.txt

2. Load in a file containing the text

The parameters are taken from ARGV[0] or ARGV.first (which both mean exactly the same thing the first element of the ARGV array).

lines = File.readlines(ARGV[0])

To process text.txt now, you will run it:

ruby analyzer.rb text.txt

3. count how many lines are within the file.

line_count = lines.size

4. Put the text into a string and measure its length.

The join method can be used to join the Array back into a single String

text = lines.join
character_count = text.length

 5. Temporarily remove all whitespace and measure the length

The gsub method String.gsub(RegExpression,substring), replaces into “String” the parts of it that meet the regular expression  and replaces it with “substring”.

character_count_nospaces = text.gsub(/\s+/, '').length

 6. Find out how many words there are.

The split method to split a string based on a single character or static sequence of characters

word_count = text.split.length

7. Split on full stops to find out how many sentences there are.

sentence_count = text.split(/\.|\?|!/).length

8. Split on double newlines to find out how many paragraphs there are

paragraph_count = text.split(/\n\n/).length

 9. Perform calculations to work out the averages.

all_words = text.scan(/\w+/)
good_words = all_words.select{ |word| !stopwords.include?(word) }
good_percentage = ((good_words.length.to_f / all_words.length.to_f) * 100).to_i

Code

# analyzer.rb -- Text Analyzer

stopwords = %w{the a by on for of are with just but and to the my I has some in}
lines = File.readlines(ARGV[0])
line_count = lines.size
text = lines.join

# Count the characters
character_count = text.length
character_count_nospaces = text.gsub(/\s+/, '').length

# Count the words, sentences, and paragraphs
word_count = text.split.length
sentence_count = text.split(/\.|\?|!/).length
paragraph_count = text.split(/\n\n/).length

# Make a list of words in the text that aren't stop words,
# count them, and work out the percentage of non-stop words
# against all words
all_words = text.scan(/\w+/)
good_words = all_words.select{ |word| !stopwords.include?(word) }
good_percentage = ((good_words.length.to_f / all_words.length.to_f) * 100).to_i

# Summarize the text by cherry picking some choice sentences
sentences = text.gsub(/\s+/, ' ').strip.split(/\.|\?|!/)
sentences_sorted = sentences.sort_by { |sentence| sentence.length }
one_third = sentences_sorted.length / 3
ideal_sentences = sentences_sorted.slice(one_third, one_third + 1)
ideal_sentences = ideal_sentences.select { |sentence| sentence =~ /is|are/ }

# Give the analysis back to the user
puts "#{line_count} lines"
puts "#{character_count} characters"
puts "#{character_count_nospaces} characters (excluding spaces)"
puts "#{word_count} words"
puts "#{sentence_count} sentences"
puts "#{paragraph_count} paragraphs"
puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"
puts "#{good_percentage}% of words are non-fluff words"
puts "Summary:\n\n" + ideal_sentences.join(". ")
puts "-- End of analysis"

Leave a comment