Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Some basic tasks of text analysis can be performed through the command line.

  • Install Wget
  • Place yourself in your working directory (e.g. /Users/susername/Documents/DHPracticum)
  • Download any text, e.g. wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt
  • Check the file is there: ls
  • Get the information about that file: file stmtn10.txt
  • Get the first few lines of the file: head stmtn10.txt
  • Get the last few lines of the file: tail stmtn10.txt
  • Copy the file to make a backup: cp stmtn10.txt stmtn10-backup.txt and check the new file is there: ls.
  • Have a look to the whole file: less -N stmtn10.txt
  • To look for a word: /giant, to find the next occurence n, and to leave the file q.
  • Remove the final lines of the documents: sed '2206,2525d' stmtn10.txt > stmtn10-nofooter.txt
  • Remove lines from 1 to 40: sed '1,40d' stmtn10-nofooter.txt > stmtn10-trimmed.txt
  • Check you have all the created files with ls
  • Count the lines of the file: wc -l stmtn10-trimmed.txt
  • Count the characters of the file: wc -m stmtn10-trimmed.txt
  • Get the line number where a word appears: grep -n "giant" stmtn10-trimmed.txt
  • Find pattern (-E) and get line numbers (-n) looking for lowercase and capital letter: grep -E -n "(G|g)iant" stmtn10-trimmed.txt
  • Standarize the text by removing the punctuation of the file: tr -d '[:punct:]' < stmtn10-trimmed.txt > stmtn10-nopunct.txt
  • And by removing capital letters: tr '[:upper:]' '[:lower:]' < stmtn10-nopunct.txt > stmtn10-lowercase.txt
  • Normalize the carriage returns: tr -d '\r' < stmtn10-lowercase.txt > stmtn10-lowercaself.txt
  • Transform each blank space into an end-of-line character: tr ' ' '\n' < stmtn10-lowercaself.txt > stmtn10-oneword.txt
  • Sort words alphabetically: sort stmtn10-oneword.txt > stmtn10-onewordsort.txt
  • Create a new file where the words are listed alphabetically, each preceded by its frequency: uniq -c stmtn10-onewordsort.txt > stmtn10-wordfreq.txt
  • Check the beginning of the file: head stmtn10-wordfreq.txt
  • Create a pipeline instead of several files: tr ' ' '\n' < stmtn10-lowercaself.txt | sort | uniq -c > stmtn10-wordfreq2.txt

This commands were taken from: William J Turkel (2013). Basic Text Analysis with Command Line Tools in Linux


Copyright © Susanna Allés Torrent