stuff for scraping executive orders

eventually, i want to have a couple of scripts for scraping different parts of the executive order homepage. for now, just one two.

this one gets titles and hrefs (to be appended to a base url in a separate program)

python eoTitleScrape.py 0 > out.txt

output from running the script twice:

Presidential Executive Order on a Comprehensive Plan for Reorganizing the Executive Branch
Executive Order Protecting The Nation From Foreign Terrorist Entry Into The United States
Presidential Executive Order on The White House Initiative to Promote Excellence and Innovation at Historically Black Colleges and Universities
Presidential Executive Order on Restoring the Rule of Law, Federalism, and Economic Growth by Reviewing the “Waters of the United States” Rule
Presidential Executive Order on Enforcing the Regulatory Reform Agenda
Providing an Order of Succession Within the Department of Justice
Presidential Executive Order on Enforcing Federal Law with Respect to Transnational Criminal Organizations and Preventing International Trafficking
Presidential Executive Order on Preventing Violence Against Federal, State, Tribal, and Local Law Enforcement Officers
Presidential Executive Order on a Task Force on Crime Reduction and Public Safety
Presidential Executive Order on Core Principles for Regulating the United States Financial System

Presidential Executive Order on Reducing Regulation and Controlling Regulatory Costs
Executive Order: ETHICS COMMITMENTS BY EXECUTIVE BRANCH APPOINTEES
EXECUTIVE ORDER: PROTECTING THE NATION FROM FOREIGN TERRORIST ENTRY INTO THE UNITED STATES
Executive Order: Border Security and Immigration Enforcement Improvements
Executive Order: Enhancing Public Safety in the Interior of the United States
Executive Order Expediting Environmental Reviews and Approvals For High Priority Infrastructure Projects
Executive Order Minimizing the Economic Burden of the Patient Protection and Affordable Care Act Pending Repeal

this one gets lines of executive orders:

now i wanna do some pattern analysis.

the adl is an israel advocacy org

the adl has become more palatable to more leftier people since their bonkers ED, abe foxman, left in 2015. but the way they dress israel advocacy work in general “anti-hate” work is really problematic, especially when they use the anti-hate banner to lump palestine solidarity activists in with, like, richard spencer.

where is some of this lumping happening? in the blog tags!

i wrote this thing to get blog posts off the old adl website. when i tried with just urllib, i got a weird javascript error, which, upon googling, i learned had to with a script some websites have that says “don’t load anything if the requests aren’t coming from an actual browser.” the workaround appears to be using a headless browser. i think what’s happening below is that phantom.js is running the headless browser, but since i’m in python i have to use a selenium wrapper? i’m not positive why i need both.

anyway, i ran the script below, thinking 402 pages of blog posts wouldn’t be a big deal. 402 pages of blog posts is, in fact, a big deal. so i changed the params to only grab pages 360 to 402.

cool! i grepped ‘students’ but then thought ‘tags’ were more interesting.

grep 'tags' -i adl-posts-pages-360-402.txt > tags2.txt

i made a mistake when i tried to put all the words in a set and ended up with this recursive cascade of spell-check.

but this worked:

cat tags.txt | python toSet.py

output

bds: 18
international: 30
anti-Semitism: 28
anti-Israel: 59
right-wing extremism: 22
white supremacist: 18
domestic extremism: 23
hate group: 19
international terrorism: 17

ugh. fuck that.

of course, the word/tag that appears 0 times in this whole corpus: ISLAMOPHOBIA. so an acrostic-ish. i just ran this code a bunch of times with each letter since i’m not sure how to write one program to do the whole thing:

example output per letter before i added some additional filters:

the “poem”:

But I know who writes your pay checks.”

which maligns and debases the Jewish State.

ADL has repeatedly urged the UN to defund and disband it.

A rundown of the call:

Efforts by Methodist anti-Israel activists to divest from Israel began in 2008

One of the activists argued that these demolitions are made possible by U.S

The boycott effort was spearheaded by a Dubai-based Palestinian novelist named Huzama Habayeb

Jewish Voice for Peace Promotes Anti-Israel Hagaddah

One of the men,

The event hosted a former Israel Defense Forces (IDF) soldier, Sergeant Benjamin Anthony.

The Muslim Student Union at UC Irvine is at it again.

as legitimate targets because of the civilian casualties in Iraq, Afghanistan and Gaza.

 

weird thing:

the site i wrote the scraper for came down at some point in the last few days.

mysteries. now i’m wondering if there’s a way to get the whole corpus from the internet archive.

python: counting words in the first executive order

cat EO1-clean.txt | python wordcount.py

then, add this to get it in a pretty format:

 

🐍🐍 python, week 5 🐍🐍

i used code from here to put all the words from the first executive order into a set, which is a way to get all the unique words in the document.

which produced “words” separated by line breaks. here are some interesting sections:

all
United
burden

PENDING
out
purchasers
Patient
for
enforceable
availability
HOUSE,

health
imperative
Nothing
benefit

Human
repeal,
or
otherwise
individuals,
control
Constitution

unwarranted
fiscal
head

with
legislative
Procedure
CARE
me
commerce
agency

Act
authorities
such
WHITE
law

affect:
impair
does
In
the
insurers,

okay so obviously some of these would be neat poems, so i tried to join them:

hmmmm noooo…

hmmmm nooooooo… okay, new activity: replacing the executive order with these poems.

i’m doing this manually for now since it would involve a bunch of regex, but i’ll record the steps here:

  1. replace all instances of “Minimizing the Economic Burden of the Patient Protection and Affordable Care Act Pending Repeal” with first poem above, “all United burden,” in the style in which the original text appears (so, with .title() or .upper())
  2. when sections begin, keep the text naming the section as such (“Section 1”, “Sec. 2”, etc.) but replace body of the section with the next poem above. remove newlines from poems above so the words flow like sentences, but don’t change case, punctuation, etc.
  3. fill sections for as many poems as were originally picked out from the set. delete sections that don’t have an accompanying poem.

this feels very related to a project i did in jer’s class last year where i replaced “mortgage” language with “data” language in hank paulson’s 2008 announcement about the economy. python woulda helped with that/made it better. anyway, executive order results here, original here.

another thing i was working on was figure out how to clean up the file without going through manually. these are things i did in the interpreter. i wonder if there’s a way to say if 'space' char appears > or = 2 times, replace it with ' ‘? it’d also be cool to figure out how to split on html tags so i don’t have to manually delete those. maybe this will be useful later.

 

horrifyingpoem.com

we’re talking about ~list comprehensions~ in class so i’m testing this demo code on the executive order i was playing with earlier.

$ only-5-letter-words.py <EO1-clean.txt >test.txt

the cut up method, uncreative writing, delusions of whiteness, et al.

the cut up method, william burroughs

into this: “You can not will spontaneity. But you can introduce the unpredictable spontaneous factor with a pair of scissors.”

and this, a great q: “HOW MANY DISCOVERIES SOUND TO KINESTHETIC?

i’ve been wanting to try this method with windows open on my computer, like this:

uncreative writing, kenneth goldsmith

“the act of pushing language around”

“gift economies, open-source cultures…” 🙄

“Lethem’s piece is a self-reflexive, demonstrative work of unoriginal genius.” o rly?

“For them, the act of writing is literally moving language from one place to another, proclaiming that context is the new content.” ah, like financialization! the value is in the asset bundling. or something.

delusions of whiteness in the avant garde, cathy park hong

“the luxurious opinion that anyone can bepost-identity'”

“expired snake oil”, “masturbatory exegesis”

what does she mean by this? “in complete transcription, in total paratactic scrambling,”

“Here is how Dworkin and Goldsmith characterize Zong: “the ethical inadequacies of that legal document . . . do not prevent their détournement in the service of experimental writing.”God forbid that maudlin and heavy-handed subjects like slavery and mass slaughter overwhelm the form!”

omg omg i promise i will not c&p the entire essay, but… “To be an identity politics poet is to be anti-intellectual, without literary merit, no complexity, sentimental, manufactured, feminine, niche-focused, woefully out-of-date and therefore woefully unhip, politically light, and deadliest of all, used as bait by market forces’ calculated branding of boutique liberalism. Compare that to Marxist—and often male—poets whose difficult and rigorous poetry may formally critique neoliberalism but is never “just about class” in the way that identity politics poetry is always “just about race,” with little to no aesthetic value.”

“…say a few more panels on forgotten subaltern poetry for the next wax museum conference?” 😱

code

all of trump’s executive orders are hosted on the whitehouse.gov site like blog posts. i pulled them all down using curl and put them into individual text files (which, btw, i wonder if there is a feed of these somewhere?):

i already loved HTML, but now i love it even more because each <tag> is followed by a newline. oh right obvi this is because browsers are trying to parse things just like i am. i am a browser. not a very good browser yet :-/

i tried wc -l and wc -m to count lines and chars. i’m sure one day these commands will be useful for something?

i cleaned up one of the executive orders by deleting all the stuff at the top and bottom, leaving just the <div>s and <p>s that contained EO content. i should figure out how to do this programmatically. maybe

  1. look through file with my actual eyeballs
  2. grep parent div that i want and remove anything before it.
  3. not sure how i would find the closing </div> for that section to delete everything after…
  4. line.strip() to get everything on its own line

anyway, i did it manually for now. i used the randomizer.py code from class to shuffle the lines of the first executive order.

$ cat EO1-clean.txt | python randomizer.py > output.txt

and then, since it was HTML, i put it in an index.html file.

i tried to do a slightly different version that splits each line at “law” and removes that word, then joins everything back together, and shuffles.

but it just resulted in a thousand AttributeErrors

so but what i really want to do is a cutup of both content and HTML tags and have the css still apply to the tags. that way, i’d get divs and buttons and menus all over the place.

a bunch of experiments with amazon annual reports

unix text processing commands in python

tr 'value' for 'surveillance'

python string methods docs for future reference

experiment 1: grep, then print line up to 100 words

$ grep 'We' amz_1997_shareholder_letter.txt | python strip_line.py >amz_we_1997.txt

$ grep 'We' amz_2015_shareholder_letter.txt | python strip_line.py >amz_we_2015.txt

experiment 2: grep, then print line between 50 and 100 chars

$ grep 'We' amz_1997_shareholder_letter.txt | python strip_line_50_100.py >amz_50-100_1997.txt

$ grep 'We' amz_2015_shareholder_letter.txt | python strip_line_50_100.py >amz_50-100_2015.txt


grep 'user base' fb_2015_annual_report.txt | python strip_line_50_100.py >fb_user_base_chunk_2015.txt

i think some questions these experiments raise for me are:

  • okay, i have some blocks of text. i’m not used to thinking about chunks of text in a structural way. what do i do with the chunks?
  • how do i break up the chunks in a programmatic way?
  • do the chunks have anything to do with the content?
  • highlighting the corporate jargon-y-ness of annual reports is not very interesting. what’s a more interesting thing to do with corporate jargon?
  • workflow! should i create a file for each new experiment? only good experiments? what’s the best way to name these slight variations? how do i document the command? i love the idea of these commands with slight variations as a score <3

we believe these lawsuits are without merit

i was trying to do a different thing when i entered
$ cat 2015-Annual-Report.txt | grep 'connect' | tr '.' '\n' | sort
but this was part of the output and i like it:

i looked for a related thing in a different text:
$ grep 'We' blackreconstruction.rtf | sort
there are some We’s in quotes:

and some not:

 

$ ~/command/line/adventure

our first homework assignment for reading and writing electronic text with allison parrish is, shockingly, to read and write electronic text.

i completed this great series of command line exercises; read padgett; loved some of these sentences from loss pequeño glazier’s “grep: a grammar” either because i’m a dork or a sucker for obscurantist writing about writing or both:

“writing as the action of production (process). That is, to a viewpoint where it’s the procedure or algorithm that counts, the output being simply a by-product of that activity.”

“Such materiality is evident in concrete conceptions of language: “literal strings,” “strings,” “regular expressions,” and “compound expressions” are among the way language is viewed in the world of grep.”

“Like the hole in Pollack’s paint can, a grep is an opening into the world of the materiality of words constituting the electronic text file.”

but my favorite part was the command line adventure of installing pdftotext. when you download the precompiled binary from the website, you get this:

i’m used to a GUI interface for installing stuff, so this was new. i opened the INSTALL instructions, which say:

for step 1, i couldn’t figure out how to copy an entire directory so i just cp’d one executable at a time.

with step 2, i ran into a problem.

the terminal kept telling me i was using the cp command wrong. some googling revealed that this happened because /usr/local/man/man1 doesn’t exist on my  mac; the man pages actually live in /usr/share/man/man1.

i sudo installed in the correct directory and checked that the page existed with man pdftotext:

voila! then, running pdftotext 2015-Annual-Report.pdf exported the 2015 facebook annual report to a text file with the same name in the same directory where i ran the command.