How to use the ECB API with Python
If you – like me – had a hard time understanding the ECB API from its official documentation this guide is for you. We will walk through an example in which we query 10 year and 1 year maturity government bond yield curves. I will show you how to smoothly extract and plot whatever ECB dataset you need with the help of a few simple functions.
The ECB API is based on the Statistical Data and Metadata Exchange (SDMX) standard. The standard is quite painful to work with directly, which is why we will be using the sdmx1
Python library to interact with the API, as well as pandas
itself. You can find the official sdmx1
documentation here.
Setup
To get started we need to install the libraries, if you haven’t done so already.
pip install sdmx1
pip install pandas
For the rest of the examples you will need to have the following setup in your file/notebook.
import sdmx
import pandas as pd
ecb = sdmx.Client("ECB")
List available dataflows
To start exploring the available data we will need a list of the available dataflows (in the SDMX standard a dataflow is called a collection of datasets that have the same form). You can do this manually by visiting their Datasets page or programmatically like this:
def get_available_dataflows(client):
dataflow_message = client.dataflow()
dataflows_df = sdmx.to_pandas(dataflow_message.dataflow)
dataflows_df.name = f"{client.source.id} Dataflows"
return dataflows_df
dataflows = get_available_dataflows(ecb)
dataflows
"""
Outputs a Series with the dataflow's id as the index
and its name as the value:
AME AMECO
BKN Banknotes statistics
BLS Bank Lending Survey Statistics
BNT Shipments of Euro Banknotes Statistics (ESCB)
BOP Euro Area Balance of Payments and Internationa...
...
SUR Opinion Surveys
TGB Target Balances
TRD External Trade
WTS Trade weights
YC Financial market data - yield curve
Name: ECB Available Dataflows, Length: 80, dtype: object
"""
Understanding a dataflow
In this example we will use the ECB’s "Financial market data - yield curve" ('YC'
) dataflow. Now we need to gather the information needed to specify which of the dataflow’s datasets we want to use. These dataflows are usually very big and the API will refuse to respond if our query is too broad. In SDMX terms this is called slicing the dataflow.
First we need to find out how to use our dataflow. We can query the API to receive some metadata about the dataflow. This metadata includes the data structure definition and the constraints.
The data structure definition includes the dimensions of the dataflow. Dimensions specify concepts of the recorded values, like which currency was used or what is the frequency of the data points.
But everything in the data structure definition doesn’t define which currency was actually used and with which frequency the data points have. For this we need the constraints which define that.
def get_dataflow_metadata(client, dataflow_id):
dataflow_message = ecb.dataflow(dataflow_id)
return (
dataflow_message.dataflow[dataflow_id].structure,
dataflow_message.constraint[f"{dataflow_id}_CONSTRAINTS"].data_content_region[0],
dataflow_message.dataflow[dataflow_id].name
)
yc_data_structure_definition, yc_constraints, yc_name = get_dataflow_metadata(ecb, 'YC')
yc_data_structure_definition
"""
Output:
<DataStructureDefinition ECB:ECB_FMD2(1.0): Financial market data (not related to foreign exchange)>
"""
The first thing we can take a look at is which dimensions our dataflow uses. Here is how we query the full list of dimensions and store them in a pandas Series:
def get_dataflow_dimensions(data_structure_definition, dataflow_name):
dimensions = data_structure_definition.dimensions
return pd.Series({
dimension.id: dimension.concept_identity.name
for dimension in dimensions
}, name=f"'{dataflow_name}' Dimensions")
yc_dimensions = get_dataflow_dimensions(yc_data_structure_definition, yc_name)
yc_dimensions
"""
Outputs a Series with the dimension's id as the index
and its name as the value:
FREQ Frequency
REF_AREA Reference area
CURRENCY Currency
PROVIDER_FM Financial market provider
INSTRUMENT_FM Financial market instrument
PROVIDER_FM_ID Financial market provider identifier
DATA_TYPE_FM Financial market data type
TIME_PERIOD Time period or range
Name: 'Financial market data - yield curve' Dimensions,
dtype: object
"""
Now that we know which dimensions exist we will need to figure out which values have been used in our dataflow. These values are called codes and they come in standardized codelists.
There are a lot more concepts in the standard that further define a dataflow and its properties, but we will keep it pragmatic and only look at what we need. To find out which codes are implemented for our dataflow we need to look at the constraints.
The raw constraint data returned by the API is a bit difficult to work with, which is why we use a few helper functions to collect each dimension with the codes we can use as a list of pandas Series.
def get_code_description(code, dimension):
codelist = dimension.local_representation.enumerated
return codelist[code].name
def get_constraint_codes(constraints, dimension):
try:
codes = constraints.member[dimension.id].values
except:
return pd.Series(name=f"'{dimension.id}' Codes", dtype='object')
codes_with_description = {
code.value: get_code_description(code.value, dimension)
for code in codes
}
return pd.Series(codes_with_description, name=f"'{dimension.id}' Codes")
def get_constraints_with_codes(data_structure_definition, constraints):
dimensions = data_structure_definition.dimensions
return [
get_constraint_codes(constraints, dimension)
for dimension in dimensions
]
yc_constraints_with_codes = get_constraints_with_codes(yc_data_structure_definition, yc_constraints)
yc_constraints_with_codes
"""
Outputs a list of Series.
The Series' index is set to the code's id and
its value to the code's name.
The Series' name itself is set to the dimension's id:
[
B Daily - businessweek
Name: 'FREQ' Codes, dtype: object,
U2 Euro area (changing composition)
Name: 'REF_AREA' Codes, dtype: object,
EUR Euro
Name: 'CURRENCY' Codes, dtype: object,
4F ECB
Name: 'PROVIDER_FM' Codes, dtype: object,
G_N_W Government bond, nominal, all issuers whose ra...
G_N_A Government bond, nominal, all issuers whose ra...
G_N_C Government bond, nominal, all issuers all rati...
Name: 'INSTRUMENT_FM' Codes, dtype: object,
...
]
"""
As you can see the dataflow only uses the code "B" which stands for "Daily - businessweek" for the dimension "Frequency" ('FREQ'
).
Shortcut: Series Keys
If you are in a hurry and you just want to see a list of all the available datasets, you can query the series keys of a dataflow. This shows you a list of which datasets are available inside the dataflow.
Series Keys are the arrangement of the different dimension codes that define a given dataset.
Here is how we query the list
def parse_series_key(series_key):
result = {value.id: value.value for value in series_key.values.values()}
return result
def get_dataflow_series_keys(client, dataflow_id, dataflow_name):
data_message = client.series_keys('YC')
series_keys = [parse_series_key(series_key) for series_key in list(data_message)]
df = pd.DataFrame.from_records(series_keys)
df.name = f"'{dataflow_name}' Series Keys"
return df
series_keys = get_dataflow_series_keys(ecb, 'YC', yc_name)
series_keys
"""
Output:
FREQ REF_AREA CURRENCY PROVIDER_FM INSTRUMENT_FM PROVIDER_FM_ID DATA_TYPE_FM
0 B U2 EUR 4F G_N_C SV_C_YM SR_25Y5M
1 B U2 EUR 4F G_N_C SV_C_YM SR_25Y6M
2 B U2 EUR 4F G_N_C SV_C_YM SR_25Y7M
3 B U2 EUR 4F G_N_C SV_C_YM SR_25Y8M
4 B U2 EUR 4F G_N_C SV_C_YM SR_25Y9M
... ... ... ... ... ... ... ...
2160 B U2 EUR 4F G_N_A SV_C_YM IF_5Y6M
2161 B U2 EUR 4F G_N_A SV_C_YM IF_5Y7M
2162 B U2 EUR 4F G_N_A SV_C_YM IF_5Y8M
2163 B U2 EUR 4F G_N_A SV_C_YM IF_5Y9M
2164 B U2 EUR 4F G_N_A SV_C_YM IF_6M
2165 rows × 7 columns
"""
This way we can directly see which datasets are available for the code we are looking for.
series_keys[series_keys.DATA_TYPE_FM == 'SR_10Y']
"""
Output:
FREQ REF_AREA CURRENCY PROVIDER_FM INSTRUMENT_FM PROVIDER_FM_ID DATA_TYPE_FM
579 B U2 EUR 4F G_N_C SV_C_YM SR_10Y
1655 B U2 EUR 4F G_N_A SV_C_YM SR_10Y
"""
Afterwards we could just copy those values and use them for the next step where we define the keys that we use to slice the dataflow.
Querying datasets
Now that we have all the information in hand we can query the datasets we want to download. For our example we’re interested in the codes "Yield curve spot rate, 10-year maturity" ('SR_10Y'
) and "Yield curve spot rate, 1-year maturity" ('SR_1Y'
) from the "Financial market data type" ('DATA_TYPE_FM'
) dimension, as well as the code "Government bond, nominal, all issuers whose rating is triple A" ('G_N_A'
) from the "Financial market instrument" ('INSTRUMENT_FM'
) dimension.
The standard also allows to optionally add a startPeriod
and/or endPeriod
parameter. We will limit the data to everything from 2012 onwards.
The following function will return the data as a pandas DataFrame with the time period set as the index and the columns set to the different dimension arrangements.
Important: When you query the dataflow for datasets make sure that they are all in the same frequency ('FREQ'
) otherwise pandas will not be able to align the index based on the time period.
def get_column(series_key):
column = series_key.attrib.TITLE.value
return column if len(column) <= 90 else column[:90] + '...'
def get_unit(data_message):
unit_codelist = data_message.structure.attributes.get('UNIT').local_representation.enumerated
series_key = next(iter(data_message.data[0].series))
return unit_codelist[series_key.attrib.UNIT.value].name.localized_default()
def get_datasets(client, dataflow_id, keys=None, startPeriod=None, endPeriod=None, data_structure_definition=None):
data_message = client.data(dataflow_id, key=keys, params={'startPeriod': startPeriod, 'endPeriod': endPeriod})
df = sdmx.to_pandas(data_message, datetime={'dim': 'TIME_PERIOD'})
columns = [get_column(series_key) for series_key in data_message.data[0].series]
return df, columns, unit
yc_df, yc_columns, yc_unit = get_datasets(
ecb,
'YC',
keys={'DATA_TYPE_FM': ['SR_10Y', 'SR_1Y'], 'INSTRUMENT_FM': ['G_N_A'], 'FREQ': 'B'},
startPeriod='2012',
)
yc_df
"""
Output:
FREQ B
REF_AREA U2
CURRENCY EUR
PROVIDER_FM 4F
INSTRUMENT_FM G_N_A
PROVIDER_FM_ID SV_C_YM
DATA_TYPE_FM SR_10Y SR_1Y
TIME_PERIOD
2012-01-02 2.717862 0.135763
2012-01-03 2.744522 0.117707
2012-01-04 2.776691 0.103733
2012-01-05 2.784807 0.161397
2012-01-06 2.788371 0.187312
... ... ...
2023-08-08 2.523608 3.285734
2023-08-09 2.567893 3.325601
2023-08-10 2.578223 3.325811
2023-08-11 2.688460 3.349347
2023-08-14 2.692476 3.366395
2966 rows × 2 columns
"""
If you receive an error fetching the data your request might be either too big or there is no available data for your key. In both cases the API responds with a 404 error.
Plotting the data
We can plot the data thanks to pandas with a simple lines of code.
import matplotlib.pyplot as plt
yc_data.plot(title=yc_name)
plt.legend(yc_columns, loc='upper left', title=None)
plt.xlabel(None)
plt.ylabel(yc_unit)
Now you should know everything needed to query the ECB’s API with ease! If you want to dig deeper, sdmx1
's docs are a great place to start otherwise if you have any questions feel free to reach out to me.