The Common Solvent for REST APIs – O’Reilly

0
33
The Common Solvent for REST APIs – O’Reilly


Information scientists running in Python or R in most cases gain records by the use of REST APIs. Each environments supply libraries that assist you to make HTTP calls to REST endpoints, then change into JSON responses into dataframes. However that’s by no means so simple as we’d like. Whilst you’re studying a large number of records from a REST API, you wish to have to do it a web page at a time, however pagination works another way from one API to the following. So does unpacking the ensuing JSON constructions. HTTP and JSON are low-level requirements, and REST is a loosely-defined framework, however not anything promises absolute simplicity, by no means thoughts consistency throughout APIs.

What if there have been some way of studying from APIs that abstracted all of the low-level grunt paintings and labored the similar approach far and wide? Just right information! This is precisely what Steampipe does. It’s a device that interprets REST API calls at once into SQL tables. Listed here are 3 examples of questions that you’ll ask and resolution the use of Steampipe.


Be informed sooner. Dig deeper. See farther.

1. Twitter: What are contemporary tweets that point out PySpark?

Right here’s a SQL question to invite that query:

make a selection
  identity,
  textual content
from
  twitter_search_recent
the place
  question = 'pyspark'
order by way of
  created_at desc
prohibit 5;

Right here’s the solution:

+---------------------+------------------------------------------------------------------------------------------------>
| identity                  | textual content                                                                                           >
+---------------------+------------------------------------------------------------------------------------------------>
| 1526351943249154050 | @sell off Tenho trabalhando bastante com Spark, mas especificamente o PySpark. Vale a pena usar um >
| 1526336147856687105 | RT @MitchellvRijkom: PySpark Tip ⚡                                                            >
|                     |                                                                                                >
|                     | When to make use of what StorageLevel for Cache / Persist?                                             >
|                     |                                                                                                >
|                     | StorageLevel comes to a decision how and the place records will have to be s…                                           >
| 1526322757880848385 | Resolve demanding situations and exceed expectancies with a profession as a AWS Pyspark Engineer. https://t.co/>
| 1526318637485010944 | RT @JosMiguelMoya1: #pyspark #spark #BigData curso completo de Python y Spark con PySpark      >
|                     |                                                                                                >
|                     | https://t.co/qf0gIvNmyx                                                                        >
| 1526318107228524545 | RT @money_personal: PySpark & AWS: Grasp Giant Information With PySpark and AWS                    >
|                     | #ApacheSpark #AWSDatabases #BigData #PySpark #100DaysofCode                                    >
|                     | -> http…                                                                                    >
+---------------------+------------------------------------------------------------------------------------------------>

The desk that’s being queried right here, twitter_search_recent, receives the output from Twitter’s /2/tweets/seek/contemporary endpoint and formulates it as a desk with those columns. You don’t need to make an HTTP name to that API endpoint or unpack the consequences, you simply write a SQL question that refers back to the documented columns. A kind of columns, question, is particular: it encapsulates Twitter’s question syntax. Right here, we’re simply on the lookout for tweets that fit PySpark however shall we as simply refine the question by way of pinning it to precise customers, URLs, varieties (is:retweetis:answer), homes (has:mentionshas_media), and so on. That question syntax is similar regardless of the way you’re getting access to the API: from Python, from R, or from Steampipe. It’s masses to take into accounts, and all you will have to in reality want to know when crafting queries to mine Twitter records.

2. GitHub: What are repositories that point out PySpark?

Right here’s a SQL question to invite that query:

make a selection 
  title, 
  owner_login, 
  stargazers_count 
from 
  github_search_repository 
the place 
  question = 'pyspark' 
order by way of stargazers_count desc 
prohibit 10;

Right here’s the solution:

+----------------------+-------------------+------------------+
| title                 | owner_login       | stargazers_count |
+----------------------+-------------------+------------------+
| SynapseML            | microsoft         | 3297             |
| spark-nlp            | JohnSnowLabs      | 2725             |
| incubator-linkis     | apache            | 2524             |
| ibis                 | ibis-project      | 1805             |
| spark-py-notebooks   | jadianes          | 1455             |
| petastorm            | uber              | 1423             |
| awesome-spark        | awesome-spark     | 1314             |
| sparkit-learn        | lensacom          | 1124             |
| sparkmagic           | jupyter-incubator | 1121             |
| data-algorithms-book | mahmoudparsian    | 1001             |
+----------------------+-------------------+------------------+

This seems to be similar to the primary instance! On this case, the desk that’s being queried, github_search_repository, receives the output from GitHub’s /seek/repositories endpoint and formulates it as a desk with those columns.

In each instances the Steampipe documentation no longer most effective presentations you the schemas that govern the mapped tables, it additionally offers examples (TwitterGitHub) of SQL queries that use the tables in quite a lot of tactics.

Word that those are simply two of many to be had tables. The Twitter API is mapped to 7 tables, and the GitHub API is mapped to 41 tables.

3. Twitter + GitHub: What have homeowners of PySpark-related repositories tweeted in recent times?

To respond to this query we want to seek the advice of two other APIs, then sign up for their effects. That’s even tougher to do, in a constant approach, while you’re reasoning over REST payloads in Python or R. However that is the type of factor SQL was once born to do. Right here’s one approach to ask the query in SQL.

-- in finding pyspark repos
with github_repos as (
  make a selection 
    title, 
    owner_login, 
    stargazers_count 
  from 
    github_search_repository 
  the place 
    question = 'pyspark' and title ~ 'pyspark'
  order by way of stargazers_count desc 
  prohibit 50
),

-- in finding twitter handles of repo homeowners
github_users as (
  make a selection
    u.login,
    u.twitter_username
  from
    github_user u
  sign up for
    github_repos r
  on
    r.owner_login = u.login
  the place
    u.twitter_username isn't null
),

-- in finding corresponding twitter customers
  make a selection
    identity
  from
    twitter_user t
  sign up for
    github_users g
  on
    t.username = g.twitter_username
)

-- in finding tweets from the ones customers
make a selection
  t.author->>'username' as twitter_user,
  'https://twitter.com/' || (t.author->>'username') || '/standing/' || t.identity as url,
  t.textual content
from
  twitter_user_tweet t
sign up for
  twitter_userids u
on
  t.user_id = u.identity
the place
  t.created_at > now()::date - period '1 week'
order by way of
  t.writer
prohibit 5

Here’s the solution:

+----------------+---------------------------------------------------------------+------------------------------------->
| twitter_user   | url                                                           | textual content                                >
+----------------+---------------------------------------------------------------+------------------------------------->
| idealoTech     | https://twitter.com/idealoTech/standing/1524688985649516544     | Can you in finding inventive soluti>
|                |                                                               |                                     >
|                |                                                               | Sign up for our @codility Order #API Challe>
|                |                                                               |                                     >
|                |                                                               | #idealolife #codility #php          >
| idealoTech     | https://twitter.com/idealoTech/standing/1526127469706854403     | Our #ProductDiscovery workforce at idealo>
|                |                                                               |                                     >
|                |                                                               | Assume you'll remedy it? ?          >
|                |                                                               | ➡️  https://t.co/ELfUfp94vB https://t>
| ioannides_alex | https://twitter.com/ioannides_alex/standing/1525049398811574272 | RT @scikit_learn: scikit-learn 1.1 i>
|                |                                                               | What is new? You'll test the releas>
|                |                                                               |                                     >
|                |                                                               | pip set up -U…                     >
| andfanilo      | https://twitter.com/andfanilo/standing/1524999923665711104      | @edelynn_belle Thank you! Occasionally it >
| andfanilo      | https://twitter.com/andfanilo/standing/1523676489081712640      | @juliafmorgado Just right good fortune at the reco>
|                |                                                               |                                     >
|                |                                                               | My recommendation: energy thru it + a useless>
|                |                                                               |                                     >
|                |                                                               | I hated my first few quick movies bu>
|                |                                                               |                                     >
|                |                                                               | Taking a look ahead to the video ?

When APIs frictionlessly turn into tables, you’ll commit your complete consideration to reasoning over the abstractions represented by way of the ones APIs. Larry Wall, the writer of Perl, famously stated: “Simple issues will have to be simple, exhausting issues will have to be imaginable.” The primary two examples are issues that are meant to be, and are, simple: each and every is solely 10 strains of straightforward, straight-ahead SQL that calls for no wizardry in any respect.

The 3rd instance is a tougher factor. It will be exhausting in any programming language. However SQL makes it imaginable in numerous great tactics. The answer is manufactured from concise stanzas (CTEs, Not unusual Desk Expressions) that shape a pipeline. Each and every section of the pipeline handles one clearly-defined piece of the issue. You’ll validate the output of each and every section earlier than continuing to the following. And you’ll do all this with probably the most mature and widely-used grammar for variety, filtering, and recombination of information.

Do I’ve to make use of SQL?

No! In case you like the theory of mapping APIs to tables, however you can slightly reason why over the ones tables in Python or R dataframes, then Steampipe can oblige. Beneath the covers it’s Postgres, enhanced with international records wrappers that care for the API-to-table transformation. Anything else that may hook up with Postgres can hook up with Steampipe, together with SQL drivers like Python’s psycopg2 and R’s RPostgres in addition to business-intelligence equipment like Metabase, Tableau, and PowerBI. So you’ll use Steampipe to frictionlessly devour APIs into dataframes, then reason why over the information in Python or R.

However if you happen to haven’t used SQL on this approach earlier than, it’s value a glance. Imagine this comparability of SQL to Pandas from Find out how to rewrite your SQL queries in Pandas.

SQL Pandas
make a selection * from airports airports
make a selection * from airports prohibit 3 airports.head(3)
make a selection identity from airports the place ident = ‘KLAX’ airports[airports.ident == ‘KLAX’].identity
make a selection distinct sort from airport airports.sort.distinctive()
make a selection * from airports the place iso_region = ‘US-CA’ and kind = ‘seaplane_base’ airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘seaplane_base’)]
make a selection ident, title, municipality from airports the place iso_region = ‘US-CA’ and kind = ‘large_airport’ airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]]

We will be able to argue the deserves of 1 taste as opposed to the opposite, however there’s no query that SQL is probably the most common and widely-implemented approach to categorical those operations on records. So no, you don’t have to make use of SQL to its fullest attainable in an effort to have the benefit of Steampipe. However it’s possible you’ll in finding that you wish to have to.