技術雜談：Unix is Friend - Text Processing

Instead of single application for a proprietary file format, Unix utilities manage text streams. Text streams means not only text files but also command line inputs and outputs. Unix comes with several handy text processing utilities. These tools co-operates well with text streams. Therefore, you should save your documents in plain text formats whenever possible. Here we briefly introduce some of them.

Before we jump into these utilities, let's look at regular expressions. Regular expressions are not standalone command utilities but a set of mini-language in many Unix utilities and programming languages. Regular expressions are compact search patterns for strings, saving a lot of conditions and loops. If you do not know regular expressions, you may still use these Unix utilities, but these utilities become more powerful with regular expressions.

There are several dialects of regular expressions in different Unix utilities, causing confusion. We suggest starting with the regular expressions of Perl, the most complete dialect of regular expressions. Check perlrequick, perlretut, and perlre for more information. A simpler way to practice and use Perl regular expressions in command line is using pcregrep, a utility bundled with Perl Compatible Regular Expressions library.

Then, let's go back to these text processing utilities. These utilities almost become parts of Unix; installation is seldom needed. Here we won't cover all text processing utilities but some common ones. They are:

iconv
sed
tr
awk
split and csplit
head and tail
nl
wc
perl

Using perl mimicking Unix utilities is another application of Perl. The advantage is that you don't need to memorize the usages of many utilities, but the alternative command in perl is usually longer. See perlrun for details. There are also books discussing Perl one-liners, like Minimal Perl for Unix and Linux People, Manning and Perl One-Liners, No Starch Press.

iconv converts text files from one character encoding to another character encoding. For example:


$ iconv -f iso-8859-1 -t utf-8 < infile > outfile
{{< / highlight >}}

Perl comes with a utility called `piconv`, which behaves like `iconv`.  It is handy for system that has no `iconv` like Microsoft Windows.

`sed` modify text streams and print out the result.  `sed` can be used with or without regular expressions.  Normally, `sed` doesn't alter your file but print out the result to stand out.  A simple usage of `sed` is like this:

```console
$ sed -i.bak 's/pattern/text/' file01 file02 file03 ...
{{< / highlight >}}

In this case, `-i` means in-place editing; original file will be saved in *file01.bak*, etc.

You may use `perl` mimicking `sed`:

```console
$ perl -p -i.bak -e 's/pattern/text/;' file01 file02 file03 ...
{{< / highlight >}}

`tr` replaces strings in character-wise level. `tr` doesn't adapt regular expressions.  To use `tr` to covert uppercase letters to lowercase letters, do this:

```console
$ tr "[:upper:]" "[:lower:]" < file
{{< / highlight >}}

If you want to list all words in a file, use `tr` to replace any characters other than alphabetic letters:

```console
$ tr -c "[:alnum:]" "\n" < file
{{< / highlight >}}

Again, you can substitute `perl` for `tr`:

```console
$ perl -pe 'tr/[A-Z]/[a-z]/;' file
{{< / highlight >}}

AWK is an interpreted programming language for data extraction and report generation.  AWK is suitable for fast one-liners text processing.  To list all users on system by AWK, do this:

```console
$ awk -F':' '/^[^#]/ { print $1 }' /etc/passwd
{{< / highlight >}}

Many features of AWK have been absorbed into Perl.  To use `perl` mimicking `awk` in the same task, do this:

```console
$ perl -a -F':' -nle 'next if /^#/; print $F[0];' /etc/passwd
{{< / highlight >}}

`split` and `csplit` splits one file into several files by regex or line numbers.  Since the behavior of `split` and `csplit` involves file I/O, there is no easy way to mimic `csplit` with `perl` one-liners.

To use `csplit` to split a file, do this:

```console
$ csplit file /pattern/
{{< / highlight >}}

`head` prints out the first several lines of a file.  Similiarly, `tail` prints out the last several lines of a file.  If running with arguments, `head` and `tail` print out 10 lines of a file.  To print the first 5 lines of a file, do this:

```console
$ head -n5 file
{{< / highlight >}}

It is also possible mimicking `head` and `tail` in `perl`, but the command is longer.

```console
$ perl -ne 'print if $. >= 0 && $. <= 5;' file
{{< / highlight >}}

`nl` calculates the line numbers in a file and print out the line numbers and the contents of the file.  It is convienent if you need the line numbers.

Here is a longer example combining `csplit`, `nl` and `perl`.  We extract the line numbers of titles in the file and split the file by line numbers.

```console
$ csplit file $(nl -ba -nln file | perl -a -nle 'print "$F[0]" if /pattern/;' \
| perl -ne ' chomp; push @a, $_; } END { print "@a";')
{{< / highlight >}}

`wc` is convienent for some basic statistics of files like character counts, word counts and line counts.  Be aware of Unicode issue; there may be multibyte characters in files.  An alternative program is `uniwc`, a part of **Unicode::Tussle** Perl module.

There may be more commands and their useages in the text processing utilities of Unix, but I won't dig too deeply.  Consult system manual or online resources if you are interested in this topic.  Good luck.

關於作者

位元詩人 (ByteBard) 是資訊領域碩士，喜歡用開源技術來解決各式各樣的問題。這類技術跨平台、重用性高、技術生命長。

除了開源技術以外，位元詩人喜歡日本料理和黑咖啡，會一些日文，有時會自助旅行。