stuff for scraping executive orders

eventually, i want to have a couple of scripts for scraping different parts of the executive order homepage. for now, just one two.

this one gets titles and hrefs (to be appended to a base url in a separate program)

python eoTitleScrape.py 0 > out.txt

output from running the script twice:

Presidential Executive Order on a Comprehensive Plan for Reorganizing the Executive Branch
Executive Order Protecting The Nation From Foreign Terrorist Entry Into The United States
Presidential Executive Order on The White House Initiative to Promote Excellence and Innovation at Historically Black Colleges and Universities
Presidential Executive Order on Restoring the Rule of Law, Federalism, and Economic Growth by Reviewing the “Waters of the United States” Rule
Presidential Executive Order on Enforcing the Regulatory Reform Agenda
Providing an Order of Succession Within the Department of Justice
Presidential Executive Order on Enforcing Federal Law with Respect to Transnational Criminal Organizations and Preventing International Trafficking
Presidential Executive Order on Preventing Violence Against Federal, State, Tribal, and Local Law Enforcement Officers
Presidential Executive Order on a Task Force on Crime Reduction and Public Safety
Presidential Executive Order on Core Principles for Regulating the United States Financial System

Presidential Executive Order on Reducing Regulation and Controlling Regulatory Costs
Executive Order: ETHICS COMMITMENTS BY EXECUTIVE BRANCH APPOINTEES
EXECUTIVE ORDER: PROTECTING THE NATION FROM FOREIGN TERRORIST ENTRY INTO THE UNITED STATES
Executive Order: Border Security and Immigration Enforcement Improvements
Executive Order: Enhancing Public Safety in the Interior of the United States
Executive Order Expediting Environmental Reviews and Approvals For High Priority Infrastructure Projects
Executive Order Minimizing the Economic Burden of the Patient Protection and Affordable Care Act Pending Repeal

this one gets lines of executive orders:

now i wanna do some pattern analysis.

the adl is an israel advocacy org

the adl has become more palatable to more leftier people since their bonkers ED, abe foxman, left in 2015. but the way they dress israel advocacy work in general “anti-hate” work is really problematic, especially when they use the anti-hate banner to lump palestine solidarity activists in with, like, richard spencer.

where is some of this lumping happening? in the blog tags!

i wrote this thing to get blog posts off the old adl website. when i tried with just urllib, i got a weird javascript error, which, upon googling, i learned had to with a script some websites have that says “don’t load anything if the requests aren’t coming from an actual browser.” the workaround appears to be using a headless browser. i think what’s happening below is that phantom.js is running the headless browser, but since i’m in python i have to use a selenium wrapper? i’m not positive why i need both.

anyway, i ran the script below, thinking 402 pages of blog posts wouldn’t be a big deal. 402 pages of blog posts is, in fact, a big deal. so i changed the params to only grab pages 360 to 402.

cool! i grepped ‘students’ but then thought ‘tags’ were more interesting.

grep 'tags' -i adl-posts-pages-360-402.txt > tags2.txt

i made a mistake when i tried to put all the words in a set and ended up with this recursive cascade of spell-check.

but this worked:

cat tags.txt | python toSet.py

output

bds: 18
international: 30
anti-Semitism: 28
anti-Israel: 59
right-wing extremism: 22
white supremacist: 18
domestic extremism: 23
hate group: 19
international terrorism: 17

ugh. fuck that.

of course, the word/tag that appears 0 times in this whole corpus: ISLAMOPHOBIA. so an acrostic-ish. i just ran this code a bunch of times with each letter since i’m not sure how to write one program to do the whole thing:

example output per letter before i added some additional filters:

the “poem”:

But I know who writes your pay checks.”

which maligns and debases the Jewish State.

ADL has repeatedly urged the UN to defund and disband it.

A rundown of the call:

Efforts by Methodist anti-Israel activists to divest from Israel began in 2008

One of the activists argued that these demolitions are made possible by U.S

The boycott effort was spearheaded by a Dubai-based Palestinian novelist named Huzama Habayeb

Jewish Voice for Peace Promotes Anti-Israel Hagaddah

One of the men,

The event hosted a former Israel Defense Forces (IDF) soldier, Sergeant Benjamin Anthony.

The Muslim Student Union at UC Irvine is at it again.

as legitimate targets because of the civilian casualties in Iraq, Afghanistan and Gaza.

 

weird thing:

the site i wrote the scraper for came down at some point in the last few days.

mysteries. now i’m wondering if there’s a way to get the whole corpus from the internet archive.

ui issue for herbivore

surya’s been helping me wrap my brain around vue/vuex so i can do more coding work on herbivore. i filed what i thought would be an easy issue in order to get started:

in my head, i was like, “oh, change the css :hover styling and call it a day.” did not take me long to discover that: no. u see, we’re changing the ui dynamically based on information we’re getting from the network. vue and vuex let you manage all this information by setting the starting state and then defining stuff you want to happen based on new information you get (with getters), actions you do, and mutations that occur to the state based on those actions. what follows is a walk-through of my process, written in present tense even though it’s already done because tense is confusing sometimes. here we go.

Continue reading

python: counting words in the first executive order

cat EO1-clean.txt | python wordcount.py

then, add this to get it in a pretty format:

 

🐍🐍 python, week 5 🐍🐍

i used code from here to put all the words from the first executive order into a set, which is a way to get all the unique words in the document.

which produced “words” separated by line breaks. here are some interesting sections:

all
United
burden

PENDING
out
purchasers
Patient
for
enforceable
availability
HOUSE,

health
imperative
Nothing
benefit

Human
repeal,
or
otherwise
individuals,
control
Constitution

unwarranted
fiscal
head

with
legislative
Procedure
CARE
me
commerce
agency

Act
authorities
such
WHITE
law

affect:
impair
does
In
the
insurers,

okay so obviously some of these would be neat poems, so i tried to join them:

hmmmm noooo…

hmmmm nooooooo… okay, new activity: replacing the executive order with these poems.

i’m doing this manually for now since it would involve a bunch of regex, but i’ll record the steps here:

  1. replace all instances of “Minimizing the Economic Burden of the Patient Protection and Affordable Care Act Pending Repeal” with first poem above, “all United burden,” in the style in which the original text appears (so, with .title() or .upper())
  2. when sections begin, keep the text naming the section as such (“Section 1”, “Sec. 2”, etc.) but replace body of the section with the next poem above. remove newlines from poems above so the words flow like sentences, but don’t change case, punctuation, etc.
  3. fill sections for as many poems as were originally picked out from the set. delete sections that don’t have an accompanying poem.

this feels very related to a project i did in jer’s class last year where i replaced “mortgage” language with “data” language in hank paulson’s 2008 announcement about the economy. python woulda helped with that/made it better. anyway, executive order results here, original here.

another thing i was working on was figure out how to clean up the file without going through manually. these are things i did in the interpreter. i wonder if there’s a way to say if 'space' char appears > or = 2 times, replace it with ' ‘? it’d also be cool to figure out how to split on html tags so i don’t have to manually delete those. maybe this will be useful later.

 

horrifyingpoem.com

we’re talking about ~list comprehensions~ in class so i’m testing this demo code on the executive order i was playing with earlier.

$ only-5-letter-words.py <EO1-clean.txt >test.txt

the cut up method, uncreative writing, delusions of whiteness, et al.

the cut up method, william burroughs

into this: “You can not will spontaneity. But you can introduce the unpredictable spontaneous factor with a pair of scissors.”

and this, a great q: “HOW MANY DISCOVERIES SOUND TO KINESTHETIC?

i’ve been wanting to try this method with windows open on my computer, like this:

uncreative writing, kenneth goldsmith

“the act of pushing language around”

“gift economies, open-source cultures…” 🙄

“Lethem’s piece is a self-reflexive, demonstrative work of unoriginal genius.” o rly?

“For them, the act of writing is literally moving language from one place to another, proclaiming that context is the new content.” ah, like financialization! the value is in the asset bundling. or something.

delusions of whiteness in the avant garde, cathy park hong

“the luxurious opinion that anyone can bepost-identity'”

“expired snake oil”, “masturbatory exegesis”

what does she mean by this? “in complete transcription, in total paratactic scrambling,”

“Here is how Dworkin and Goldsmith characterize Zong: “the ethical inadequacies of that legal document . . . do not prevent their détournement in the service of experimental writing.”God forbid that maudlin and heavy-handed subjects like slavery and mass slaughter overwhelm the form!”

omg omg i promise i will not c&p the entire essay, but… “To be an identity politics poet is to be anti-intellectual, without literary merit, no complexity, sentimental, manufactured, feminine, niche-focused, woefully out-of-date and therefore woefully unhip, politically light, and deadliest of all, used as bait by market forces’ calculated branding of boutique liberalism. Compare that to Marxist—and often male—poets whose difficult and rigorous poetry may formally critique neoliberalism but is never “just about class” in the way that identity politics poetry is always “just about race,” with little to no aesthetic value.”

“…say a few more panels on forgotten subaltern poetry for the next wax museum conference?” 😱

code

all of trump’s executive orders are hosted on the whitehouse.gov site like blog posts. i pulled them all down using curl and put them into individual text files (which, btw, i wonder if there is a feed of these somewhere?):

i already loved HTML, but now i love it even more because each <tag> is followed by a newline. oh right obvi this is because browsers are trying to parse things just like i am. i am a browser. not a very good browser yet :-/

i tried wc -l and wc -m to count lines and chars. i’m sure one day these commands will be useful for something?

i cleaned up one of the executive orders by deleting all the stuff at the top and bottom, leaving just the <div>s and <p>s that contained EO content. i should figure out how to do this programmatically. maybe

  1. look through file with my actual eyeballs
  2. grep parent div that i want and remove anything before it.
  3. not sure how i would find the closing </div> for that section to delete everything after…
  4. line.strip() to get everything on its own line

anyway, i did it manually for now. i used the randomizer.py code from class to shuffle the lines of the first executive order.

$ cat EO1-clean.txt | python randomizer.py > output.txt

and then, since it was HTML, i put it in an index.html file.

i tried to do a slightly different version that splits each line at “law” and removes that word, then joins everything back together, and shuffles.

but it just resulted in a thousand AttributeErrors

so but what i really want to do is a cutup of both content and HTML tags and have the css still apply to the tags. that way, i’d get divs and buttons and menus all over the place.

update: the application formerly known as ajooba

we had our first semi-official user testing session! i made a document that i thought could guide our session, but it was actually hard to stick to once we got started. i think for future user testing sessions, none of us devs/designers should be in the room because we talk too much and wanna go into great detail about what we ~intended~ with different features. we still got great feedback because we were showing the tool to people who know a ton about teaching and technology. next time will be a more formal thing, and i think we’ll write a script/list of objectives for someone else to administer.

eve took amazing notes, which we converted into a list of issues file-able on github:

she’s busy animating transitions that will hopefully make some of the information we’re conveying more intuitive to understand. i’m really excited about what she’s working on.

meanwhile, i’ve been trying to wrap my brain around the code base since i stopped keeping up with it after surya did a big awesome refactor a few months ago. the front is vue.js with vuex. the back is node.js. the back talks to the front via sockets. that’s about where i’m at right now. i made this sketch of vue files:

and am redoing the annotated file tree exercise:

/Users/jkagan/Desktop/herbivore
|-- NOTES.md
|-- README.md
|-- app
|  |-- App.vue
|  |-- components
|  |  |-- Console.vue
|  |  |-- InfoBar.vue
|  |  |-- NavMenu.vue
|  |  |-- ToolBar.vue
|  |  |-- tools
|  |  |  |-- Network.vue - frontend table for network view (columns, ability to sort, etc)
|  |  |  |-- Sniffer.vue - frontend table for sniffer view (columns, ability to sort, key code shortcuts. etc)
|  |  |  -- SnifferPayload.vue - frontend window for payload view in sniffer tool; this is where the HTTP vs HTTPS text lives
|  | 
-- viz
|  |     |-- Grid.js
|  |     |-- Viz.vue - for testing; deleted
|  |     |-- VizTree.vue - vue template for visualization area
|  |     -- VizTreeStyleParams.js
|  |-- filters
|  | 
-- index.js
|  |-- main.js - everything that's in the app div, rendered in index.html
-- store
|     |-- modules
|       |-- sniffer.js - includes actions, getters, mutations re: newPacket, clearSnifferInfo
|       |-- toolbar.js - includes actions, getters, mutations re: currentTool, currentView, toolRunning, toolNames, clearToolbarInfo
|      
-- network-info.js
|     |-- actions.js
|     |-- getters.js
|     |-- index.js
|     -- mutation-types.js - this is basically a list of all the ways to commit new data to the vuex store. this is how state is updated in network view, sniffer view, and toolbar side panel. all these mutation types are used in the 'modules' files: sniffer.js, toolbar.js, network-info.js
|-- assets
-- imgs
|     -- play.png
|-- dist
|  |-- build.js
-- public
|     -- fonts
|        |-- photon-entypo.eot
|        |-- photon-entypo.ttf
|       
-- photon-entypo.woff
|-- fonts
|-- herbivore-darwin-x64 - binary
|-- index.html - thing that renders app div from app/main.js, build.js script
|-- main.js - first thing to launch. sets new browser window, sets sudo permissions for sniffing, sets sockets that talk to vue.js and network scripts
|-- network-scripts
|  |-- Network.js - network constructor function; outputs to terminal. to init, _getHostNamePromise, _getHostName, _scanArpTable, _pingSubnet, _getAllHostnames, _checkHost, _getHostBuffer, cmd, start, stop
|  |-- Sniffer.js
|  |-- ToolManager.js
|  |-- old
|  |  |-- pcap-parser.js
|  |  -- tcp-test.js
-- pcap-filters.js
|-- package.json
|-- styles - photon styling, plus old scss stuff
|  |-- main.scss
-- photon.css
-- webpack.config.js - webpack builds the dist/build.js file that's rendered in index.html