Screen scraping with TextMate
One of the capabilities which get little mention is TextMate’s recordable macros. I even get feature requests for things which could easily be done with a macro (in seconds) and macros do not require any programming , so they are probably a little overlooked. What follows is one example where I wouldn’t have been without them.
Yesterday PayPal failed to send notifications when people made a purchase (using IPN). This means I had to enter the data manually into my database. The data I was interested in is; transaction ID, name, email, price, currency, fee, date + time, and the custom field (so that’d be 8-9 copy operations per order confirmation page).
My current sales volume doesn’t justify writing an actual script to fetch the data (which is behind a cumbersome login procedure), but is big enough for me to not want to do this manually for each order (though my threshold for when not wanting to do it manually is often just 2 :) ). So I opened all the order confirmations in background tabs (so that I could retire the mouse after this), and then for each of these did ⌘A, ⌘C (copy entire page to clipboard), ⌘⇥, ⌘V (switch to TextMate and paste the page).
I ended up with all the pages in TextMate, but in the text version that my browser was able to create from these. Next step was to reformat the data, and that’s where the macros are useful, cause all I had to do was place the caret at the beginning of the first page pasted, press ⌥⌘M (start recording a macro) then reformat this first page as I wanted it (i.e. for inputing into my database). After having reformatted it I pressed ⌥⌘M (to stop macro recording) and simply pressed ⌃⌘M repeatedly, to reformat each of the following pages one-by-one.
After only a few minutes I had all my orders nicely formatted for input into my database.
I probably should have saved the macro for the next time I need to do this, but it was so easy to create, so I didn’t ;)
Some hints when you create macros for text reformatting:
- Let the macro end with the caret in front of the next piece of data that (potentially) needs to be reformatted. That way you won’t need to move the caret between executions. If there is “noise” between the data blocks, let the last macro action be a search for the start of the next data block. That way you will even get a nice beep after having reformatted the last block (since it won’t find a new block).
- Use the find window whenever you need to move the caret an unknown amount of characters. If you use the find window it will record the find string you use in the macro, as opposed to using “Find Next” (⌘G), which use the current contents of the find clipboard.
- You can place the selection on the find clipboard with “Use Selection for Find” (⌘E).
- You can use regular expressions for conditional replaces. For example, if a web-link may be an email, and you want to append
mailto:
if it is, let the macro select the link and do a regular expression replace (all in selection) which searches for:^\w+@\w+\.\w+$
(should match an email) and replace it withmailto:$0
(since if it doesn’t match, nothing will be replaced). Additionally you can test if a capture was matched in the format string, so if instead we search for:^(\w+@\w+\.\w+)?|.*$
(email placed in capture register 1, or match whatever, if no email exists) and replace it with:?1mailto\:$0:$0.html
, it will prependmailto:
to emails, and append.html
to everything else. See Regular Expressions in TextMate’s Help Book for more about the syntax of these things.