Tutorial 1: using a Quantum Device to extract machine-learning features
(download this tutorial here (external))
The companion library is a machine-learning tool that uses a quantum device to predict similarities and properties of graphs. In this tutorial, we will reproduce first part of the QEK paper (external) using the library's high-level API. The high-level goal of this tutorial is to predict toxicity properties of molecules, using quantum machine learning, but of course, the mechanisms involved are much more generic.
By the end of this notebook, you will know how to:
- Setup import for a molecular dataset (the library supports other type of graphs, of course).
- Setup compilation and execution of these graphs for execution on a Quantum Device (either an emulator or a physical QPU).
- Launch the execution and extract the relevant machine-learning features.
A companion notebook will guide you through machine-learning with QEK.
If, instead of using the library's high-level API, you prefer digging a bit closer to the qubits, you may prefer the companion low-level notebook that mirrors this notebook, but using a lower-level API that will let you experiment with different quantum pulses.
Dataset preparation
As in any machine learning task, we first need to load and prepare data. QEK can work with many types of graphs, including molecular graphs. For this tutorial, we will use the PTC-FM dataset, which contains such molecular graphs, labeled with their toxicity.
import torch_geometric.datasets as pyg_dataset
from qek.shared.retrier import PygRetrier
# Load the original PTC-FM dataset.
# We use PygRetrier to retry the download if it fails.
og_ptcfm = PygRetrier().insist(pyg_dataset.TUDataset, root="dataset", name="PTC_FM")
display("Loaded %s samples" % (len(og_ptcfm), ))
'Loaded 349 samples'
To extract machine-learning features from our dataset, we will need to configure a feature extractor. This library provides several feature extractors to either make use of a physical quantum device (QPU), or a variety of emulators.
To configure a feature extractor, we will need to provide a compiler, whose task is to take a list of graphs, extract embeddings and compile these embeddings to sequences of pulses, the format that can be executed by either a QPU or an emulator. For this tutorial, our dataset is composed of molecule graphs following the PTCFM conventions, so we will use the PTCFMCompiler
:
import qek.data.graphs as qek_graphs
compiler = qek_graphs.PTCFMCompiler()
This library also provides other compilers from a variety of graph formats.
Creating and executing a feature extractor from an emulator
The easiest way to process a graph is to compile and execute it for an emulator. QEK is built on top of Pulser, which provides several emulators. The simplest of these emulators is the QutipEmulator
, which QEK uses for the QutipExtractor
:
from pathlib import Path
import qek.data.extractors as qek_extractors
# Use the Qutip Extractor.
extractor = qek_extractors.QutipExtractor(
# Once computing is complete, data will be saved in this file.
path=Path("saved_data.json"),
compiler=compiler
)
# Add the graphs using the compiler we've picked previously.
extractor.add_graphs(graphs=og_ptcfm)
# We may now compile them.
compiled = extractor.compile()
display("Compiled %s sequences" % (len(compiled), ))
'Compiled 272 sequences'
As you can see, the number of sequences compiled is lower than the number of samples loaded. Some of this is due to limitations within the algorithm (not all graphs can be efficiently laid out for execution on a Quantum Device), while others are due to the limitations of the emulator we target (which at the time of this writing is limited to ~50 qubits).
We may now run the extraction on the emulator:
# Limit the number of qubits for this run, for performance reasons.
# You can increase this value to higher number of qubits, but this
# notebook will take longer to execute and may run out of memory.
#
# On our test computer, the practical limit is around 10 qubits.
max_qubits = 5
processed_dataset = extractor.run(max_qubits=max_qubits).processed_data
display("Extracted features from %s samples"% (len(processed_dataset), ))
'Extracted features from 40 samples'
If you wish to extract features from more samples, feel free to increase the value of MAX_QUBITS
above. However, you will soon run into limitations of a quantum emulator, and possibly crash this notebook. At this point, you have other options, such as using EmuMPSExtractor
instead of QutipExtractor
, a more recent emulator that features much better performance in most cases, or you can run the extraction on a physical QPU.
Creating and executing a feature extractor on a physical QPU
Once you have checked that low qubit sequences provide the results you expect on an emulator, you will generally want to move to a QPU. For this, you will need either physical access to a QPU, or an account with PASQAL Cloud (external), which provides you remote access to QPUs built and hosted by Pasqal. In this section, we'll see how to use the latter.
If you don't have an account, just skip to the next section!
from pathlib import Path
HAVE_PASQAL_ACCOUNT = False # If you have a PASQAL Cloud account, fill in the details and set this to `True`.
if HAVE_PASQAL_ACCOUNT:
# Use the QPU Extractor.
extractor = qek_extractors.RemoteQPUExtractor(
# Once computing is complete, data will be saved in this file.
path=Path("saved_data.json"),
compiler = compiler,
project_id = "XXXX", # Replace this with your project id on the PASQAL Cloud
username = "XXX", # Replace this with your username on PASQAL Cloud
password = None, # Replace this with your password on PASQAL Cloud or enter it on the command-line
)
# Add the graphs, exactly as above.
extractor.add_graphs(graphs=og_ptcfm)
# We may now compile, exactly as above.
compiled = extractor.compile()
display("Compiled %s sequences" % (len(compiled), ))
# Launch the execution.
extracted = extractor.run()
display("Work enqueued with ids %s" % (extractor.batch_ids, ))
# ...and wait for the results.
await extracted
processed_data = extracted.processed_data
display("Extracted features from %s samples"% (len(processed_data), ))
As you can see, the process is essentially identical to executing with an emulator. Note that, as of this writing, the waiting line to access a QPU can be very long (typically several hours).
There are two main ways to deal with this:
RemoteQPUExtractor
can be attached to an ongoing job from batch ids, so that you can resume your work e.g. after turning off your computer.- Pasqal CLOUD offers access to high-performance hardware-based emulators, with dramatically
shorter waiting lines. For instance, in the snippet above, you may replace
RemoteQPUExtractor
withRemoteEmuMPSExtractor
to use the emu-mps emulator.
See the documentation (external) for more details.
...or using the provided dataset
For this notebook, instead of spending hours running the simulator on your computer, we're going to skip
this step and load on we're going to cheat and load the results, which are conveniently stored in ptcfm_processed_dataset.json
.
import qek.data.processed_data as qek_dataset
processed_dataset = qek_dataset.load_dataset(file_path="ptcfm_processed_dataset.json")
print(f"Size of the quantum compatible dataset = {len(processed_dataset)}")
Size of the quantum compatible dataset = 279
A look at the results
We can check the geometry for one of the samples:
dataset_example = processed_dataset[64]
dataset_example.draw_register()
dataset_example.draw_pulse()
The results of executing the embedding on the Quantum Device are in field state_dict
:
display(dataset_example.state_dict)
print(f"Total number of samples: {sum(dataset_example.state_dict.values())}")
{'00100000100': 15, '00100010010': 13, '10100100001': 7, '10000100000': 2, '10000000010': 29, '10000001010': 43, '01000000000': 20, '10000000000': 33, '10100001011': 3, '00001010001': 2, '01000001010': 9, '01000000100': 7, '00110000000': 6, '00100101010': 2, '10000000001': 13, '10010101100': 3, '01000010001': 8, '00000000000': 11, '00100000010': 21, '00100001100': 24, '01001010010': 2, '10000001001': 13, '00110001010': 15, '00101000010': 3, '00100010001': 4, '00110010010': 9, '10001001000': 4, '00100100010': 3, '00100000001': 6, '01000010010': 17, '10100001000': 8, '10110000100': 2, '10000010000': 11, '00010000000': 3, '00101001000': 2, '00100000000': 40, '00110000010': 11, '00100100011': 5, '10010000000': 17, '00100001010': 38, '10000001000': 16, '10001010011': 1, '10001010010': 5, '10000001100': 16, '10110000000': 7, '10010010010': 6, '00100001001': 19, '10010000010': 7, '00101001011': 3, '00101000100': 1, '10101001001': 5, '10100100000': 2, '10000010010': 20, '10000101011': 3, '00010101100': 5, '00000101000': 2, '00000010000': 5, '00101010000': 6, '00001001001': 1, '01001010000': 4, '00110001000': 5, '01001000011': 2, '00100100001': 2, '01000101010': 3, '10110100000': 3, '10100100010': 1, '01000100100': 4, '01100000010': 1, '10110001100': 2, '10010000100': 2, '00100001000': 18, '10110100100': 2, '00000001010': 10, '00101010010': 1, '10000000100': 15, '00000001100': 3, '10110001000': 5, '10100000011': 2, '10110010000': 5, '00000000100': 3, '10000100011': 3, '00010101000': 3, '01001001011': 1, '00101001001': 1, '10000010001': 2, '10000101001': 1, '00100010000': 7, '10100001010': 8, '10100100011': 3, '10100010010': 3, '01000001100': 9, '00110001100': 2, '10100010000': 2, '10000000011': 5, '10101000001': 3, '10100000000': 11, '00000100000': 6, '10110100010': 4, '00000101010': 2, '10101010011': 6, '00100000011': 4, '10010001010': 14, '00101000011': 2, '10010001000': 2, '00000001000': 7, '01001000000': 1, '00001010000': 2, '00110101010': 1, '01001001001': 1, '00100101001': 2, '10100101001': 2, '00010001000': 2, '10100001100': 5, '10000100100': 2, '10101000000': 2, '00110000100': 2, '10010010000': 2, '01000010000': 9, '10101001010': 1, '01000100000': 2, '00101010001': 1, '00001000001': 3, '01001001010': 2, '00000100011': 2, '00100001101': 1, '10000101010': 4, '10000001011': 3, '00100100000': 2, '00000001001': 1, '00100110001': 1, '00110010000': 2, '10010001100': 1, '00010100100': 3, '00000100001': 2, '01000001001': 5, '01000001000': 8, '10100000010': 2, '00000001011': 1, '10001000010': 1, '10110101000': 1, '00110100010': 5, '01000000010': 11, '00000000010': 3, '10010101000': 1, '00010100010': 2, '10100010001': 1, '10001000000': 3, '01000100010': 2, '00000010010': 2, '10101001011': 1, '10001010000': 2, '00001001000': 1, '00110100000': 4, '10010100000': 1, '00101001010': 2, '01000101000': 1, '00101010011': 1, '10100001101': 1, '01001010011': 2, '01001010001': 1, '01001101010': 1, '10100000100': 2, '01001000001': 3, '10101010001': 4, '10110011100': 1, '01000000001': 1, '00110000011': 1, '00100101011': 3, '00000100010': 1, '10000101000': 1, '00001010011': 1, '10001001010': 4, '00001001010': 1, '10010100010': 3, '01000101011': 1, '00001000011': 1, '10001001011': 2, '00101000001': 1, '10001000011': 1, '01000010011': 1, '10001001001': 1, '01001001000': 1, '10110000010': 1, '00100010011': 1, '00101001110': 1, '10001000001': 2, '00000000011': 1, '10000101100': 1, '00101000000': 1, '00100001011': 1, '10100000001': 1, '00100101000': 1, '00010001100': 1, '00010010000': 1, '01000000011': 2, '10000100010': 1, '01000010110': 1}
Total number of samples: 1000
This dictionary represents an approximation of the quantum state of the device for this graph after completion of the algorithm.
- each of the keys represents one possible state for the register (which represents the graph), with each qubit (which represents a single node) being in state
0
or1
; - the corresponding value is the number of samples observed with this specific state of the register.
In this example, for instance, we can see that the state observed most frequently is 10000001010
, with 43/1000 samples.
Note: Since Quantum Devices are inherently non-deterministic, you will probably obtained different samples if you run this on a Quantum Device instead of loading the dataset.
Machine learning-features
From the state dictionary, we derive as machine-learning feature the distribution of excitation. We'll use this in a second to define our kernel.
dataset_example.draw_excitation()
Is this distribution of excitation connected to the toxicity of the molecule? To check this out, we'll need to perform some machine-learning engineering.
Note Of course, you could derive features completely unrelated to distribution of excitation.
What now?
What we have seen so far covers the use of a Quantum Device to extract machine-learning features.
For the next step, we'll see how to use these features for actual machine learning.