Some basic tasks of text analysis can be performed through the command line.
- Install Wget
- Place yourself in your working directory (e.g. /Users/susername/Documents/DHPracticum)
- Download any text, e.g.
wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt - Check the file is there:
ls - Get the information about that file:
file stmtn10.txt - Get the first few lines of the file:
head stmtn10.txt - Get the last few lines of the file:
tail stmtn10.txt - Copy the file to make a backup:
cp stmtn10.txt stmtn10-backup.txtand check the new file is there:ls. - Have a look to the whole file:
less -N stmtn10.txt - To look for a word:
/giant, to find the next occurencen, and to leave the fileq. - Remove the final lines of the documents:
sed '2206,2525d' stmtn10.txt > stmtn10-nofooter.txt - Remove lines from 1 to 40:
sed '1,40d' stmtn10-nofooter.txt > stmtn10-trimmed.txt - Check you have all the created files with
ls - Count the lines of the file:
wc -l stmtn10-trimmed.txt - Count the characters of the file:
wc -m stmtn10-trimmed.txt - Get the line number where a word appears:
grep -n "giant" stmtn10-trimmed.txt - Find pattern (-E) and get line numbers (-n) looking for lowercase and capital letter:
grep -E -n "(G|g)iant" stmtn10-trimmed.txt - Standarize the text by removing the punctuation of the file:
tr -d '[:punct:]' < stmtn10-trimmed.txt > stmtn10-nopunct.txt - And by removing capital letters:
tr '[:upper:]' '[:lower:]' < stmtn10-nopunct.txt > stmtn10-lowercase.txt - Normalize the carriage returns:
tr -d '\r' < stmtn10-lowercase.txt > stmtn10-lowercaself.txt - Transform each blank space into an end-of-line character:
tr ' ' '\n' < stmtn10-lowercaself.txt > stmtn10-oneword.txt - Sort words alphabetically:
sort stmtn10-oneword.txt > stmtn10-onewordsort.txt - Create a new file where the words are listed alphabetically, each preceded by its frequency:
uniq -c stmtn10-onewordsort.txt > stmtn10-wordfreq.txt - Check the beginning of the file:
head stmtn10-wordfreq.txt - Create a pipeline instead of several files:
tr ' ' '\n' < stmtn10-lowercaself.txt | sort | uniq -c > stmtn10-wordfreq2.txt
This commands were taken from: William J Turkel (2013). Basic Text Analysis with Command Line Tools in Linux