Auto insurance

Advanced ‘grep’: Search for Multiple Strings Simultaneously

Posted in Programming on December 14th, 2010 by Carl Zulauf

It took me a while to figure out how to do this so I thought I’d post a quick guide for other people who might need to have this trick up their sleeve.

The unix/linux grep command can use POSIX extended regular expressions. I learned from this guide that unlike most other regular expression implementations the meta-characters must be escaped to be treated as meta-characters. If they are not escaped grep will treat them as literal characters. This is exactly opposite of most regular expression engines where you escape meta-characters when you don’t want them treated by the regex engine as special characters.

I’m not particularly great at regex and I didn’t find a lot of guides on using extended POSIX regular expressions with grep. This really surprised me.

In my scenario I had a really large file I needed to search through. I had several unique strings I thought might be in this file and due to its size I didn’t want to execute a separate grep command for each string. This would be woefully inefficient as the entire file would be searched independently for each string. This approach would also take much much longer to complete and would be a lot more work for me.

Instead, I knew there had to be a way to craft a single command that would search the file once looking for any lines containing any of the strings I supplied. Well, using my mediocre knowledge of regular expressions and my new-found understanding of the behavior of grep’s regex engine I came up with the following command:

grep -nE \(string1\|string2\|string3\|string4\) theFile.txt

Breaking down the command further:

  • grep – Indicates we are using the unix/linux grep utility.
  • -nE – These are the grep “options.” ‘n’ tells grep to tell us which line number any matches came from when it spits out its results. ‘E’ tells grep to use its extended regular expression engine (required for this kind of search).
  • \( and \) – These are the “range” boundaries for the regular expression. Technically not required here, but I find it good practice to use ranges in case you ever need to back reference. Grep would treat these like any other string if they weren’t escaped with the backslash. We want grep to treat these as special regex characters (range boundaries in this case) so we must escape them.
  • \| – The regex “OR” operator. This tells grep that it can match either the value to the left or the value to the right of this operator. If you read out the statement above as “or” the command makes perfect sense. We are telling grep: find string1 OR string2 OR string3 OR string4 in theFile.txt. Since we want this to be treated by grep as an OR operator and not just another character it also needs to be escaped with a backslash.
  • string1, string2, string3, and string4 – The strings grep should be looking for in the file. The line “I like the taste of string1′s bread” would match string1 and would be in the results outputted by grep. The line “I think String2 has a really mean demeanor” would not match since this search is case sensitive by default.
Tags: , , ,

Printing a Specific Line From a Large File in Linux

Posted in Programming on December 4th, 2009 by Carl Zulauf

I recently had to find a specific line in a large (28GB) file equipped with nothing more than the line number. I thought it would take me just a few seconds to find a cool *nix utility to accomplish this task. Instead, it took me a bit of scouring to find something that works, and works well on large files. That’s OK though since I had to wait for the 28GB file to uncompress from a tarball… which, obviously, takes a while.

What I learned about while I waited was the *nix command ‘sed’. This is a tool built for command line processing of data files. Apparently it was birthed as an evolution of our trusty friend ‘grep’. The forum post I found which hinted that ‘sed’ was my solution didn’t provide much real information and the Wikipedia article was mostly background and provided examples that won’t help me.

Where I found the most useful info was the sed page on sourceforge… go figure. The docs page pointed me to ‘The sed one-liners‘ by Eric Pement. Here I found, through example, the power of ‘sed’ and an example that is more efficient on large files than the ones I found elsewhere.

So here is how you do it:

sed '34005050q;d' filename

’34005050′ is the line number. ‘q’ tells sed you are looking for that line number, and ‘;d’ tells it to stop after that line. ‘filename’ is of course the file you are trying to coax a specific line out of. To do an inclusive range of lines all at once (lines 8 through 12, for example), do this:

sed '8,12!d' filename

I’m still learning about ‘sed’ but its already saving my ass. Have fun.

Tags: , , ,