Results 1 to 13 of 13

Thread: AWK help...

  1. #1
    cat /dev/null streetster's Avatar
    Join Date
    Jul 2003
    Location
    London
    Posts
    4,138
    Thanks
    119
    Thanked
    100 times in 82 posts
    • streetster's system
      • Motherboard:
      • Asus P7P55D-E
      • CPU:
      • Intel i5 750 2.67 @ 4.0Ghz
      • Memory:
      • 4GB Corsair XMS DDR3
      • Storage:
      • 2x1TB Drives [RAID0]
      • Graphics card(s):
      • 2xSapphire HD 4870 512MB CrossFireX
      • PSU:
      • Corsair HX520W
      • Case:
      • Coolermaster Black Widow
      • Operating System:
      • Windows 7 x64
      • Monitor(s):
      • DELL U2311
      • Internet:
      • Virgin 50Mb

    AWK help...

    Hey Guys,

    Ive got some coursework which needs to be scripted in AWK.. Basically I need to search through html files and pulling out all the links (href= /img src= etc) and counting the number of times they occur in the webpage... Now I spent about 3 hours yesterday learning how to use AWK and came up with a script which *almost* works 100%.. except that if a line contains more than one link, it is simply ignored...

    I used a for ( i = 1; i <= NR ; i++ ) loop and it seems that as soon as a record is found ( if ( $i ~ /^href/ ) ) that part of the script is executed (which is fine) but then isntead of continuing to iterate around the loop it jumps onto the next line (of the input file)...

    I'll post my code if its any use to anyone.. but how can i stop AWK from jumping out of the loop (there are no break / continue statements).. I'm confused and the coursework is due in on Thursday

  2. #2
    cat /dev/null streetster's Avatar
    Join Date
    Jul 2003
    Location
    London
    Posts
    4,138
    Thanks
    119
    Thanked
    100 times in 82 posts
    • streetster's system
      • Motherboard:
      • Asus P7P55D-E
      • CPU:
      • Intel i5 750 2.67 @ 4.0Ghz
      • Memory:
      • 4GB Corsair XMS DDR3
      • Storage:
      • 2x1TB Drives [RAID0]
      • Graphics card(s):
      • 2xSapphire HD 4870 512MB CrossFireX
      • PSU:
      • Corsair HX520W
      • Case:
      • Coolermaster Black Widow
      • Operating System:
      • Windows 7 x64
      • Monitor(s):
      • DELL U2311
      • Internet:
      • Virgin 50Mb
    here's the code if anyone can help:


    Code:
    #!/bin/gawk -f
    # awk script to do cool stuff
    
    	# hyperlinks begin with '<a href=' so look for all lines which contain href
    	/href/	{
    
    		for (i=1; i< NF+1; i++) # for each record on line
    		{
    			if ( $i ~ "^href" )				# if the record begins with href
    	 		{
    
    			   $f  = substr($i , 7)
    			   len = index($f , "\"")
    			   $f  = substr($f , 0 , len - 1)
    							
    				 if ($f != "") 					# if the entry isnt blank then we need to add it to an array
    					
    					{
     								
     					# if ends in html/htm/php:
     					if ( $f ~ "html$" || $f ~ "htm$" || $f ~ "php$" || $f ~ "shtml$")
     					 {
    					  htm[$f] = htm[$f] + 1	   ;		# add it to the html array
    					 }
    					# otherwise we have another extension eg .exe)
    					else 
    					 {
    					  def[$f] = def[$f] + 1 ;		# add to default array
    					 }
    					}
    							
    			}
    		}
    		}
    
    	# image links begin with '<img src=' so look for all lines which contain src
    	/src/		{
    
    		for (i=1; i< NF; i++) # for each record on line
    
    			if ( $i ~ "^src" )				# if the record begins with src
    			 {
    				len = length($i) -6;			# work out where the end of the link is
    				
    				$f = substr($i, 6, len )		# strip everything but the link (eg /img/picture.jpg)
    				
    				# search for JPG/JPEG/GIF/PNG (upper/lowercase also)
    				if ($f ~ "[Jj][Pp][Gg]$" || $f ~ "[Jj][Pp][Ee][Gg]$" || $f ~ "[Gg][Ii][Ff]$" || $f ~ "[Pp][Nn][Gg]$" )
    			   		img[$f] = img[$f] + 1;		        # add to the image array
    					
    			}
    					
    	}
    	
    	END{ print "\tHTML Documents: ";
    	
    		 for (entry in htm) print "\t"entry, htm[entry];
    		 
    		 print "\n\tIMAGES: ";
    		 
    		 for (entry in img) print "\t"entry, img[entry];
    		 
    		 print "\n\tOTHER: ";
    		 
    		 for (entry in def) print "\t"entry, def[entry];
    		 		 
    	}
    Last edited by streetster; 10-05-2005 at 10:26 AM.

  3. #3
    Senior Member Kezzer's Avatar
    Join Date
    Sep 2003
    Posts
    4,863
    Thanks
    12
    Thanked
    5 times in 5 posts
    I'd help if i could but i've never developed in that language

  4. #4
    cat /dev/null streetster's Avatar
    Join Date
    Jul 2003
    Location
    London
    Posts
    4,138
    Thanks
    119
    Thanked
    100 times in 82 posts
    • streetster's system
      • Motherboard:
      • Asus P7P55D-E
      • CPU:
      • Intel i5 750 2.67 @ 4.0Ghz
      • Memory:
      • 4GB Corsair XMS DDR3
      • Storage:
      • 2x1TB Drives [RAID0]
      • Graphics card(s):
      • 2xSapphire HD 4870 512MB CrossFireX
      • PSU:
      • Corsair HX520W
      • Case:
      • Coolermaster Black Widow
      • Operating System:
      • Windows 7 x64
      • Monitor(s):
      • DELL U2311
      • Internet:
      • Virgin 50Mb
    was hoping some unix god would show up and tell me how to do it.. or just say 'AWK is a pattern matching scripting language and only picks up the first word on the line' which will make me cry

  5. #5
    Senior Member
    Join Date
    Oct 2003
    Posts
    2,069
    Thanks
    4
    Thanked
    7 times in 3 posts
    Lol Kezzer...
    I've never done it either, so heres another useful response
    Twigman

  6. #6
    cat /dev/null streetster's Avatar
    Join Date
    Jul 2003
    Location
    London
    Posts
    4,138
    Thanks
    119
    Thanked
    100 times in 82 posts
    • streetster's system
      • Motherboard:
      • Asus P7P55D-E
      • CPU:
      • Intel i5 750 2.67 @ 4.0Ghz
      • Memory:
      • 4GB Corsair XMS DDR3
      • Storage:
      • 2x1TB Drives [RAID0]
      • Graphics card(s):
      • 2xSapphire HD 4870 512MB CrossFireX
      • PSU:
      • Corsair HX520W
      • Case:
      • Coolermaster Black Widow
      • Operating System:
      • Windows 7 x64
      • Monitor(s):
      • DELL U2311
      • Internet:
      • Virgin 50Mb
    cheers Paul ... how come no-one knows any AWK? i managed to teach myself some in 4hours.. surely someone must've done a course in it or something!

  7. #7
    Senior Member Kezzer's Avatar
    Join Date
    Sep 2003
    Posts
    4,863
    Thanks
    12
    Thanked
    5 times in 5 posts
    I'd never heard of it until when you mentioned it. I could probably figure it out but i haven't got time

  8. #8
    Senior Member Nemeliza's Avatar
    Join Date
    Jul 2003
    Posts
    1,719
    Thanks
    1
    Thanked
    5 times in 5 posts
    Quote Originally Posted by KeZZeR
    I'd never heard of it until when you mentioned it.

  9. #9
    Senior Member Kezzer's Avatar
    Join Date
    Sep 2003
    Posts
    4,863
    Thanks
    12
    Thanked
    5 times in 5 posts
    Damn you

  10. #10
    Comfortably Numb directhex's Avatar
    Join Date
    Jul 2003
    Location
    /dev/urandom
    Posts
    17,074
    Thanks
    228
    Thanked
    1,027 times in 678 posts
    • directhex's system
      • Motherboard:
      • Asus ROG Strix B550-I Gaming
      • CPU:
      • Ryzen 5900x
      • Memory:
      • 64GB G.Skill Trident Z RGB
      • Storage:
      • 2TB Seagate Firecuda 520
      • Graphics card(s):
      • EVGA GeForce RTX 3080 XC3 Ultra
      • PSU:
      • EVGA SuperNOVA 850W G3
      • Case:
      • NZXT H210i
      • Operating System:
      • Ubuntu 20.04, Windows 10
      • Monitor(s):
      • LG 34GN850
      • Internet:
      • FIOS
    if my memory serves me correctly, by default, AWK presumes that a new line means a new record, and a space means a new field.

    you can override these (the FS and RS variables) in a BEGIN{} block

    so add to the start of your code:

    Code:
    BEGIN {
      RS="<"
    }
    that'll change AWK's settings, so rather than working on a per-line basis, it'll work on a per-HTML-tag basis, so:

    Code:
    <html><badgers></badgers></html>
    would be four records, not one. unfortunately, i can't remember whether the records do or don't include the < character, but i'm not sure it matters in the context

  11. #11
    cat /dev/null streetster's Avatar
    Join Date
    Jul 2003
    Location
    London
    Posts
    4,138
    Thanks
    119
    Thanked
    100 times in 82 posts
    • streetster's system
      • Motherboard:
      • Asus P7P55D-E
      • CPU:
      • Intel i5 750 2.67 @ 4.0Ghz
      • Memory:
      • 4GB Corsair XMS DDR3
      • Storage:
      • 2x1TB Drives [RAID0]
      • Graphics card(s):
      • 2xSapphire HD 4870 512MB CrossFireX
      • PSU:
      • Corsair HX520W
      • Case:
      • Coolermaster Black Widow
      • Operating System:
      • Windows 7 x64
      • Monitor(s):
      • DELL U2311
      • Internet:
      • Virgin 50Mb
    sorted cheers

  12. #12
    Senior Member
    Join Date
    Oct 2003
    Posts
    2,069
    Thanks
    4
    Thanked
    7 times in 3 posts
    Another useful post:
    Was it you I saw on the grass outside the union this afternoon ~3:15pm?
    I was going to go and see, but I cba.
    Twigman

  13. #13
    cat /dev/null streetster's Avatar
    Join Date
    Jul 2003
    Location
    London
    Posts
    4,138
    Thanks
    119
    Thanked
    100 times in 82 posts
    • streetster's system
      • Motherboard:
      • Asus P7P55D-E
      • CPU:
      • Intel i5 750 2.67 @ 4.0Ghz
      • Memory:
      • 4GB Corsair XMS DDR3
      • Storage:
      • 2x1TB Drives [RAID0]
      • Graphics card(s):
      • 2xSapphire HD 4870 512MB CrossFireX
      • PSU:
      • Corsair HX520W
      • Case:
      • Coolermaster Black Widow
      • Operating System:
      • Windows 7 x64
      • Monitor(s):
      • DELL U2311
      • Internet:
      • Virgin 50Mb
    yeh it was, i was attempting to sunbathe as it was such a lovely day

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Awk
    By Raz316 in forum Software
    Replies: 2
    Last Post: 30-10-2003, 05:54 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •