Scraping Slack & Enriching Data (NL)

A more graphic version of this article is published on SourceCon : https://www.sourcecon.com/scraping-slack/

Close your eyes. Imagine. Amsterdam. City of Sourcers.

The canals, nice bars & night life. Summer heat.

We’re on a mission seeking Drupal experts in the Netherlands using Slack.

Relax. Go with the flow.

I’ll show you a two-step method to find full names, emails and LinkedIn profiles of Slack users.

1st part: Scraping Slack

2nd part: Finding the associated LinkedIn profile

PART 1

Let’s start by joining: “Drupal Nederland” (drupalnl.slack.com).

We automatically become member of a general channel: #drupalnl (477 members).

Looking a bit more, notice these two channels #evenementen and #front-end.

Assumption: people who subscribe to all 3 channels are active users i.e the ones we’re looking for.

(There can be many different assumptions. They’ll change only the dataset, not the process. We‘ll talk about it later on.)

Now let’s log into LinkedIn and connect PhantomBuster (works perfectly with the free version).

Visit the site and in the Phantom Section, select the “Slack Channel User Extractor”.

For this phantom to work, we will need two things:

  1. The slack workspace URL
  2. A publicly shared google sheet listing the channels we want to scrape.

The slack group URL can be found on slack itself.

Let’s now create the Google Sheet: sheets.new

Insert the channels we want to scrape in the first column.

For PhantomBuster to be able to access it; we need to publicly share the file.

Back to our Phantom’s Setup:

First, click to grab the slack’s session cookie. Insert the Slack Workspace URL.

Finally, paste the google sheet’s link in Phantom Buster.

Save.

Choose to be notified in case of error & Save again.

Now, press launch!

Be patient.

This is what it looks like after scraping successfully. The nuggets are in “result.csv”.

Open a new google sheet (sheets.new) and import “result.csv”.

And…Boom 💥

Awesome! We made it.

Now. Let’s take a step back for a second. Back to assumptions.
Take a closer look at the outcome, notice the “lastUpdate” column.
It shows the person’s last connection to the slack.

We could assume that the people connected in the last month are the most active. We should then scrape the whole group and select only the people active in the last month.
Combinations are endless. You can try them all. The process stays the same.

Now back to business.

Turning on Spotify.

Let’s find out who is registered on the 3 channels and clean the data.

Scrolling down we see that there is one line per person in each channel.

We’ll use this to count the occurrences.

For that we need a unique identifier: the email address.

COUNTIF does the job.

Drag down and it’s done.

We now know who the people in 3 channels are.

To have a clean list, let’s remove duplicates.

But before, copy/paste the values of our indicator (occurrences) in a new column.
Why? Because once we remove duplicates, the occurrences will logically drop to 1 for everyone.

Remove duplicates (choose the row containing emails).

Another COUNTIF tells us how many people subscribed to 1, 2 or 3 channels.

Let’s do a Checksum to check everything is correct.

Knowing #drupalnl is the biggest channel with 477 people, it should add up.

One person is missing!
After a quick check, we discover Rocket.Chat (a bot) that is in 0 channels.
We’re all good!

Part 2 – Finding the LinkedIn profiles.

To do this, we’ll use another Phantom.

In the free version, we’ll need to go to our Dashboard and delete our actual Phantom.
Store & load “LinkedIn Profile URL Finder”.

Again, we’ll need a publicly shared spreadsheet.

To work, this Phantom needs Full Names & Companies.
Since we don’t have the companies, let’s tweak this.

Think about a discriminatory criterion very likely to be true.
How about: “Drupal” and “Netherlands”

CONCATENATE does the job.

Copy the sheet’s link in the Phantom Setup.

Follow the same steps as before and press Launch.

Less than a minute later it’s over.

Again: download “result.csv” & import in a google sheet.

Copy the LinkedIn column and paste it back in our google sheet.

Use FILTER to clean the data (we loose 20 people).

And there you have it !

A list of 57 people digging Drupal in the Netherlands.
Their full names, email and LinkedIn profiles.

You can use this process to scrape any Slack.

You’ll need to manually check the list for potential false positive.

Remember the quality of your assumptions will determine the quality of your outcomes.

Enjoy!

Een reactie achterlaten

Je e-mailadres zal niet getoond worden.