stuff for scraping executive orders

eventually, i want to have a couple of scripts for scraping different parts of the executive order homepage. for now, just one two.

this one gets titles and hrefs (to be appended to a base url in a separate program)

python eoTitleScrape.py 0 > out.txt

output from running the script twice:

Presidential Executive Order on a Comprehensive Plan for Reorganizing the Executive Branch
Executive Order Protecting The Nation From Foreign Terrorist Entry Into The United States
Presidential Executive Order on The White House Initiative to Promote Excellence and Innovation at Historically Black Colleges and Universities
Presidential Executive Order on Restoring the Rule of Law, Federalism, and Economic Growth by Reviewing the “Waters of the United States” Rule
Presidential Executive Order on Enforcing the Regulatory Reform Agenda
Providing an Order of Succession Within the Department of Justice
Presidential Executive Order on Enforcing Federal Law with Respect to Transnational Criminal Organizations and Preventing International Trafficking
Presidential Executive Order on Preventing Violence Against Federal, State, Tribal, and Local Law Enforcement Officers
Presidential Executive Order on a Task Force on Crime Reduction and Public Safety
Presidential Executive Order on Core Principles for Regulating the United States Financial System

Presidential Executive Order on Reducing Regulation and Controlling Regulatory Costs
Executive Order: ETHICS COMMITMENTS BY EXECUTIVE BRANCH APPOINTEES
EXECUTIVE ORDER: PROTECTING THE NATION FROM FOREIGN TERRORIST ENTRY INTO THE UNITED STATES
Executive Order: Border Security and Immigration Enforcement Improvements
Executive Order: Enhancing Public Safety in the Interior of the United States
Executive Order Expediting Environmental Reviews and Approvals For High Priority Infrastructure Projects
Executive Order Minimizing the Economic Burden of the Patient Protection and Affordable Care Act Pending Repeal

this one gets lines of executive orders:

now i wanna do some pattern analysis.

the adl is an israel advocacy org

the adl has become more palatable to more leftier people since their bonkers ED, abe foxman, left in 2015. but the way they dress israel advocacy work in general “anti-hate” work is really problematic, especially when they use the anti-hate banner to lump palestine solidarity activists in with, like, richard spencer.

where is some of this lumping happening? in the blog tags!

i wrote this thing to get blog posts off the old adl website. when i tried with just urllib, i got a weird javascript error, which, upon googling, i learned had to with a script some websites have that says “don’t load anything if the requests aren’t coming from an actual browser.” the workaround appears to be using a headless browser. i think what’s happening below is that phantom.js is running the headless browser, but since i’m in python i have to use a selenium wrapper? i’m not positive why i need both.

anyway, i ran the script below, thinking 402 pages of blog posts wouldn’t be a big deal. 402 pages of blog posts is, in fact, a big deal. so i changed the params to only grab pages 360 to 402.

cool! i grepped ‘students’ but then thought ‘tags’ were more interesting.

grep 'tags' -i adl-posts-pages-360-402.txt > tags2.txt

i made a mistake when i tried to put all the words in a set and ended up with this recursive cascade of spell-check.

but this worked:

cat tags.txt | python toSet.py

output

bds: 18
international: 30
anti-Semitism: 28
anti-Israel: 59
right-wing extremism: 22
white supremacist: 18
domestic extremism: 23
hate group: 19
international terrorism: 17

ugh. fuck that.

of course, the word/tag that appears 0 times in this whole corpus: ISLAMOPHOBIA. so an acrostic-ish. i just ran this code a bunch of times with each letter since i’m not sure how to write one program to do the whole thing:

example output per letter before i added some additional filters:

the “poem”:

But I know who writes your pay checks.”

which maligns and debases the Jewish State.

ADL has repeatedly urged the UN to defund and disband it.

A rundown of the call:

Efforts by Methodist anti-Israel activists to divest from Israel began in 2008

One of the activists argued that these demolitions are made possible by U.S

The boycott effort was spearheaded by a Dubai-based Palestinian novelist named Huzama Habayeb

Jewish Voice for Peace Promotes Anti-Israel Hagaddah

One of the men,

The event hosted a former Israel Defense Forces (IDF) soldier, Sergeant Benjamin Anthony.

The Muslim Student Union at UC Irvine is at it again.

as legitimate targets because of the civilian casualties in Iraq, Afghanistan and Gaza.

 

weird thing:

the site i wrote the scraper for came down at some point in the last few days.

mysteries. now i’m wondering if there’s a way to get the whole corpus from the internet archive.

ui issue for herbivore

surya’s been helping me wrap my brain around vue/vuex so i can do more coding work on herbivore. i filed what i thought would be an easy issue in order to get started:

in my head, i was like, “oh, change the css :hover styling and call it a day.” did not take me long to discover that: no. u see, we’re changing the ui dynamically based on information we’re getting from the network. vue and vuex let you manage all this information by setting the starting state and then defining stuff you want to happen based on new information you get (with getters), actions you do, and mutations that occur to the state based on those actions. what follows is a walk-through of my process, written in present tense even though it’s already done because tense is confusing sometimes. here we go.

Continue reading

python: counting words in the first executive order

cat EO1-clean.txt | python wordcount.py

then, add this to get it in a pretty format:

 

🐍🐍 python, week 5 🐍🐍

i used code from here to put all the words from the first executive order into a set, which is a way to get all the unique words in the document.

which produced “words” separated by line breaks. here are some interesting sections:

all
United
burden

PENDING
out
purchasers
Patient
for
enforceable
availability
HOUSE,

health
imperative
Nothing
benefit

Human
repeal,
or
otherwise
individuals,
control
Constitution

unwarranted
fiscal
head

with
legislative
Procedure
CARE
me
commerce
agency

Act
authorities
such
WHITE
law

affect:
impair
does
In
the
insurers,

okay so obviously some of these would be neat poems, so i tried to join them:

hmmmm noooo…

hmmmm nooooooo… okay, new activity: replacing the executive order with these poems.

i’m doing this manually for now since it would involve a bunch of regex, but i’ll record the steps here:

  1. replace all instances of “Minimizing the Economic Burden of the Patient Protection and Affordable Care Act Pending Repeal” with first poem above, “all United burden,” in the style in which the original text appears (so, with .title() or .upper())
  2. when sections begin, keep the text naming the section as such (“Section 1”, “Sec. 2”, etc.) but replace body of the section with the next poem above. remove newlines from poems above so the words flow like sentences, but don’t change case, punctuation, etc.
  3. fill sections for as many poems as were originally picked out from the set. delete sections that don’t have an accompanying poem.

this feels very related to a project i did in jer’s class last year where i replaced “mortgage” language with “data” language in hank paulson’s 2008 announcement about the economy. python woulda helped with that/made it better. anyway, executive order results here, original here.

another thing i was working on was figure out how to clean up the file without going through manually. these are things i did in the interpreter. i wonder if there’s a way to say if 'space' char appears > or = 2 times, replace it with ' ‘? it’d also be cool to figure out how to split on html tags so i don’t have to manually delete those. maybe this will be useful later.