本贴的目的是通过一个实例来说明*nix命令行工具如何简化一些常见的工作. 如果你愿意, 最后需要的代码仅仅只有3行, 分别是:
query.awk:
Quote:
{if ($1==id) {print $2}}
以及merge.awk:
Quote:
{
if ( ((query_command " id=" $1 " " score_file)|getline score) <=0){score=na_string}
print $0, score
}
awk是*nix下的一个文本处理语言, 一般用来处理简单的数据文件. a,w,k是发明这个语言的三人的名字首字母. 关键概念是records和fields: 默认每行是一个record, 而每行中由whitespace (比如空格, tab, etc.)分隔出fields.
关于awk的介绍请看
http://www-900.ibm.com/developerWorks/cn/linux/shell/index.shtml 中的awk实例, 比如其中第一篇
http://www-900.ibm.com/developerWorks/cn/linux/shell/awk/awk-1/index.shtml状况:
[*]一个由excel文件导出的csv文件(comma separated values),里面记录了学号, 姓名. 比如:
record.csv
Quote:
18851007,bohr
18790314,einstein
12345678,galilette
...
[*]一个文件, 记录了学号和成绩,中间用tab隔开, 比如
score.dat
Quote:
12345678 100
18790314 59
...
要求: 把成绩合并到记录文件中
疑难分析:
[*]score.dat和record.csv中, 记录的顺序不同, 比如record中按姓氏笔划排, score.dat中按批改顺序排
[*]record中有所有学生, 但不是每个学生都会交作业. 比如这次bohr同学没交
Solution:
创建两个awk脚本, query.awk用来查询特定学号的同学的分数, merge.awk用来按record中的顺序调用query获得成绩(如果旷交, 用N/A代替)
# pseudo code for query.awk:
for each line in data file,
if field_1 is equal to give ID, then return field_2
next
#! /bin/awk -f
# query.awk
## Notice: this script returns nothing if the ID NO is not found
# commandline provided variables: id
{
if ($1==id) {
print $2}
}
# pseudo code for merge.awk:
for each line in record,
call query.awk with id=field_1
if returned value is blank,
then set score = N/A
else
set score = value returned by query
end if
append score to current line, and print to stdout
next
#! /bin/awk -f
# merge.awk:
BEGIN {
query_command = "./query.awk"
score_file="./score.dat"
na_string="N/A"
#reset FieldSeparator
FS=","
#reset OutputFieldSeparator
OFS=","
}
{
if ( ((query_command " id=" $1 " " score_file)|getline score) <=0){
score=na_string
}
print $0, score
}
最后操作:
Quote:
galilette@socrate:~$ ls
merge.awk query.awk record.csv score.dat
galilette@socrate:~$ ./merge.awk record.csv
18851007,bohr,N/A
18790314,einstein,59
12345678,galilette,100
...