Some basic tasks of text analysis can be performed through the command line.
- Install Wget
- Place yourself in your working directory (e.g. /Users/susername/Documents/DHPracticum)
- Download any text, e.g.
wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt
- Check the file is there:
ls
- Get the information about that file:
file stmtn10.txt
- Get the first few lines of the file:
head stmtn10.txt
- Get the last few lines of the file:
tail stmtn10.txt
- Copy the file to make a backup:
cp stmtn10.txt stmtn10-backup.txt
and check the new file is there:ls
. - Have a look to the whole file:
less -N stmtn10.txt
- To look for a word:
/giant
, to find the next occurencen
, and to leave the fileq
. - Remove the final lines of the documents:
sed '2206,2525d' stmtn10.txt > stmtn10-nofooter.txt
- Remove lines from 1 to 40:
sed '1,40d' stmtn10-nofooter.txt > stmtn10-trimmed.txt
- Check you have all the created files with
ls
- Count the lines of the file:
wc -l stmtn10-trimmed.txt
- Count the characters of the file:
wc -m stmtn10-trimmed.txt
- Get the line number where a word appears:
grep -n "giant" stmtn10-trimmed.txt
- Find pattern (-E) and get line numbers (-n) looking for lowercase and capital letter:
grep -E -n "(G|g)iant" stmtn10-trimmed.txt
- Standarize the text by removing the punctuation of the file:
tr -d '[:punct:]' < stmtn10-trimmed.txt > stmtn10-nopunct.txt
- And by removing capital letters:
tr '[:upper:]' '[:lower:]' < stmtn10-nopunct.txt > stmtn10-lowercase.txt
- Normalize the carriage returns:
tr -d '\r' < stmtn10-lowercase.txt > stmtn10-lowercaself.txt
- Transform each blank space into an end-of-line character:
tr ' ' '\n' < stmtn10-lowercaself.txt > stmtn10-oneword.txt
- Sort words alphabetically:
sort stmtn10-oneword.txt > stmtn10-onewordsort.txt
- Create a new file where the words are listed alphabetically, each preceded by its frequency:
uniq -c stmtn10-onewordsort.txt > stmtn10-wordfreq.txt
- Check the beginning of the file:
head stmtn10-wordfreq.txt
- Create a pipeline instead of several files:
tr ' ' '\n' < stmtn10-lowercaself.txt | sort | uniq -c > stmtn10-wordfreq2.txt
This commands were taken from: William J Turkel (2013). Basic Text Analysis with Command Line Tools in Linux