galilette |
2005-03-11 16:11 |
本贴的目的是通过一个实例来说明*nix命令行工具如何简化一些常见的工作. 如果你愿意, 最后需要的代码仅仅只有3行, 分别是: query.awk:
Quote: {if ($1==id) {print $2}}
以及merge.awk:
Quote: { if ( ((query_command " id=" $1 " " score_file)|getline score) <=0){score=na_string} print $0, score }
awk是*nix下的一个文本处理语言, 一般用来处理简单的数据文件. a,w,k是发明这个语言的三人的名字首字母. 关键概念是records和fields: 默认每行是一个record, 而每行中由whitespace (比如空格, tab, etc.)分隔出fields.
关于awk的介绍请看 http://www-900.ibm.com/developerWorks/cn/linux/shell/index.shtml 中的awk实例, 比如其中第一篇 http://www-900.ibm.com/developerWorks/cn/linux/shell/awk/awk-1/index.shtml 状况:
[*]一个由excel文件导出的csv文件(comma separated values),里面记录了学号, 姓名. 比如: record.csv
Quote: 18851007,bohr 18790314,einstein 12345678,galilette ...
[*]一个文件, 记录了学号和成绩,中间用tab隔开, 比如 score.dat
Quote: 12345678 100 18790314 59 ...
要求: 把成绩合并到记录文件中
疑难分析:
[*]score.dat和record.csv中, 记录的顺序不同, 比如record中按姓氏笔划排, score.dat中按批改顺序排 [*]record中有所有学生, 但不是每个学生都会交作业. 比如这次bohr同学没交
Solution: 创建两个awk脚本, query.awk用来查询特定学号的同学的分数, merge.awk用来按record中的顺序调用query获得成绩(如果旷交, 用N/A代替)
# pseudo code for query.awk: for each line in data file, if field_1 is equal to give ID, then return field_2 next
#! /bin/awk -f # query.awk ## Notice: this script returns nothing if the ID NO is not found
# commandline provided variables: id
{ if ($1==id) { print $2} }
# pseudo code for merge.awk: for each line in record, call query.awk with id=field_1 if returned value is blank, then set score = N/A else set score = value returned by query end if append score to current line, and print to stdout next
#! /bin/awk -f # merge.awk:
BEGIN { query_command = "./query.awk" score_file="./score.dat" na_string="N/A" #reset FieldSeparator FS="," #reset OutputFieldSeparator OFS="," } { if ( ((query_command " id=" $1 " " score_file)|getline score) <=0){ score=na_string } print $0, score }
最后操作:
Quote: galilette@socrate:~$ ls merge.awk query.awk record.csv score.dat galilette@socrate:~$ ./merge.awk record.csv 18851007,bohr,N/A 18790314,einstein,59 12345678,galilette,100 ...
|
|