Do the Doodle: Predicting "Quick, Draw!" drawings based on player's first drawing stroke

EC Corro | SS Garcia | J Gonzales | CR Patalud

DATA PREPROCESSING

Dask Distributed Setup and Data Preprocessing

In [1]:
import dask.bag as db
import dask.array as da
import dask.dataframe as dd
# from dask.distributed import Client


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


pd.set_option('display.max_colwidth', None)
import warnings
warnings.filterwarnings("ignore")

Preselected Doodle Dataset are Stored in an S3 bucket s3://bdccdoodle/drawings_csv

The dataset size used for this project is 53.1 GB.\ This is composed of 82 csv files ranging from 278.8 MB to 1.3 GB

In [5]:
%%bash
aws s3 ls s3://bdccdoodle/drawings_csv/ --human-readable --summarize
2021-01-23 12:10:19    0 Bytes 
2021-01-23 12:11:44  383.5 MiB ant.csv
2021-01-23 12:11:44  464.2 MiB apple.csv
2021-01-23 12:11:44  454.7 MiB arm.csv
2021-01-23 12:11:46  667.4 MiB asparagus.csv
2021-01-23 12:11:46  914.6 MiB banana.csv
2021-01-23 12:11:44  618.5 MiB bat.csv
2021-01-23 12:11:46  627.2 MiB bear.csv
2021-01-23 12:11:44  579.2 MiB bee.csv
2021-01-23 12:11:44  567.8 MiB bird.csv
2021-01-23 12:11:46  655.9 MiB blackberry.csv
2021-01-23 12:11:44  375.4 MiB blueberry.csv
2021-01-23 12:11:46  873.4 MiB brain.csv
2021-01-23 12:11:46  633.0 MiB broccoli.csv
2021-01-23 12:11:44  550.3 MiB butterfly.csv
2021-01-23 12:11:44  498.3 MiB carrot.csv
2021-01-23 12:11:44  565.5 MiB cat.csv
2021-01-23 12:11:46  853.5 MiB cow.csv
2021-01-23 12:11:46  718.5 MiB crab.csv
2021-01-23 12:11:46  740.1 MiB crocodile.csv
2021-01-23 12:11:46  778.5 MiB dog.csv
2021-01-23 12:11:44  535.3 MiB dolphin.csv
2021-01-23 12:11:47    1.3 GiB duck.csv
2021-01-23 12:11:44  288.5 MiB ear.csv
2021-01-23 12:11:44  365.6 MiB elbow.csv
2021-01-23 12:11:46    1.1 GiB elephant.csv
2021-01-23 12:11:44  480.2 MiB eye.csv
2021-01-23 12:11:46  644.6 MiB face.csv
2021-01-23 12:11:44  392.6 MiB feather.csv
2021-01-23 12:11:44  398.7 MiB finger.csv
2021-01-23 12:11:44  621.1 MiB fish.csv
2021-01-23 12:11:44  601.1 MiB flower.csv
2021-01-23 12:11:46  770.9 MiB frog.csv
2021-01-23 12:11:46  893.6 MiB garden.csv
2021-01-23 12:11:46  636.1 MiB giraffe.csv
2021-01-23 12:11:46  941.2 MiB goatee.csv
2021-01-23 12:11:46  640.3 MiB grapes.csv
2021-01-23 12:11:44  388.3 MiB grass.csv
2021-01-23 12:11:46  690.9 MiB hedgehog.csv
2021-01-23 12:11:46    1.1 GiB horse.csv
2021-01-23 12:11:44  576.9 MiB house plant.csv
2021-01-23 12:11:46  943.8 MiB kangaroo.csv
2021-01-23 12:11:46  649.9 MiB knee.csv
2021-01-23 12:11:44  442.2 MiB leaf.csv
2021-01-23 12:11:44  278.8 MiB leg.csv
2021-01-23 12:11:46  745.1 MiB lion.csv
2021-01-23 12:11:46  870.4 MiB lobster.csv
2021-01-23 12:11:46  761.5 MiB monkey.csv
2021-01-23 12:11:44  465.2 MiB mosquito.csv
2021-01-23 12:11:44  606.1 MiB mouse.csv
2021-01-23 12:11:44  554.6 MiB moustache.csv
2021-01-23 12:11:44  523.3 MiB mouth.csv
2021-01-23 12:11:44  401.3 MiB nail.csv
2021-01-23 12:11:44  362.4 MiB nose.csv
2021-01-23 12:11:44  563.7 MiB onion.csv
2021-01-23 12:11:46  955.1 MiB owl.csv
2021-01-23 12:11:44  598.4 MiB palm tree.csv
2021-01-23 12:11:46  696.6 MiB panda.csv
2021-01-23 12:11:46  884.0 MiB parrot.csv
2021-01-23 12:11:44  333.0 MiB peanut.csv
2021-01-23 12:11:46  649.7 MiB peas.csv
2021-01-23 12:11:47    1.2 GiB penguin.csv
2021-01-23 12:11:46  937.6 MiB pig.csv
2021-01-23 12:11:46  782.2 MiB rabbit.csv
2021-01-23 12:11:46  783.7 MiB raccoon.csv
2021-01-23 12:11:46 1003.4 MiB rhinoceros.csv
2021-01-23 12:11:46  897.1 MiB scorpion.csv
2021-01-23 12:11:44  622.6 MiB sea turtle.csv
2021-01-23 12:11:44  604.4 MiB shark.csv
2021-01-23 12:11:44  595.1 MiB sheep.csv
2021-01-23 12:11:44  577.0 MiB skull.csv
2021-01-23 12:11:44  613.2 MiB snail.csv
2021-01-23 12:11:44  463.8 MiB snake.csv
2021-01-23 12:11:46  816.6 MiB spider.csv
2021-01-23 12:11:46  890.9 MiB squirrel.csv
2021-01-23 12:11:44  428.9 MiB strawberry.csv
2021-01-23 12:11:46  710.5 MiB swan.csv
2021-01-23 12:11:46  831.7 MiB tiger.csv
2021-01-23 12:11:44  490.5 MiB toe.csv
2021-01-23 12:11:46  626.7 MiB tree.csv
2021-01-23 12:11:44  554.4 MiB watermelon.csv
2021-01-23 12:11:46  666.6 MiB whale.csv
2021-01-23 12:11:46 1017.8 MiB zebra.csv

Total Objects: 83
   Total Size: 53.1 GiB

Loading the Data

column name Data Type Description
countrycode String Country Code where the drawing originated
drawing Array Drawing array representing the drawing's stroke coordinates and time
key_id String Unique key ID of the drawing
recognized Boolean Boolean identification whether the drawing was successfuly recognized by the AI
timestamp Time Object Timestamp when the object was drawn
word String The Object Being Drawn by the Participant
In [5]:
df = dd.read_csv('s3://bdccdoodle/drawings_csv/*.csv', 
                  error_bad_lines=False,
                  storage_options={'anon':True}, 
                  dtype={'drawing': 'object'})
In [6]:
df.tail(2)
Out[6]:
countrycode drawing key_id recognized timestamp word
5880 GB [[[991, 986, 982, 978, 974, 971, 967, 963, 959, 955, 951, 946, 941, 936, 931, 927, 924, 922, 917, 911, 906, 901, 896, 891, 886, 881, 876, 870, 865, 860, 856, 855, 854, 854, 854, 854, 854, 855, 856, 857, 858, 859, 859, 859, 859, 859, 857, 854, 852, 848, 846, 841, 836, 828, 822, 817, 812, 807, 804, 803, 801, 801, 799, 799, 799, 799, 799, 799, 799, 799, 799, 799, 799, 799, 800, 800, 802, 802, 802, 802, 796, 790, 784, 779, 774, 769, 764, 759, 753, 751, 748, 746, 744, 744, 744, 742, 742, 742, 742, 742, 742, 742, 742, 742, 741, 739, 737, 732, 727, 721, 716, 711, 705, 700, 696, 691, 688, 685, 685, 685, 685, 685, 683, 683, 683, 682, 681, 681, 681, 681, 681, 680, 680, 678, 670, 663, 657, 652, 647, 642, 637, 633, 632, 631, 630, 629, 629, 629, 629, 629, 630, 632, 634, 634, 634, 632, 629, 625, 620, 615, 610, 605, 600, 596, 593, 592, 592, 592, 591, 591, 591, 591, 591, 591, 591, 594, 595, 598, 600, 600, 600, 597, 592, 587, 581, 575, 570, 569, 569, 568, 566, 563, 561, 559, 559, 559, 557, 557, 557, 557, 557, 554, 551, 547, 543, 538, 533, 528, 525, 523, 522, 522, 520, 520, 520, 520, 520, 520, 520, 520, 520, 520, 520, 521, 525, 527, 529, 530, 530, 530, 525, 520, 513, 507, 501, 496, 492, 491, 490, 488, 487, 487, 487, 492, 496, 501, 506, 512, 518, 523, 528, 534, 539, 544, 549, 555, 560, 566, 572, 577, 584, 589, 596, 602, 607, 613, 619, 626, 631, 638, 643, 648, 654, 659, 667, 674, 679, 684, 693, 699, 707, 713, 720, 726, 734, 739, 746, 754, 761, 766, 773, 779, 787, 797, 803, 808, 814, 819, 825, 830, 836, 844, 849, 854, 860, 865, 871, 876, 883, 890, 896, 901, 907, 912, 918, 926, 933, 945, 951, 959, 965, 972, 977, 983, 990, 996, 1001, 1007, 1012, 1015, 1018, 1021, 1023, 1027, 1028, 1030, 1032, 1034, 1037, 1039, 1040, 1043, 1043, 1043, 1045, 1046, 1048, 1049, 1050, 1052, 1055, 1056, 1057, 1057, 1047, 1035, 1029, 1019, 1014, 1009, 1004, 999, 994], [230, 225, 220, 215, 210, 205, 200, 195, 190, 185, 180, 179, 179, 180, 183, 188, 193, 198, 202, 206, 209, 211, 213, 214, 216, 220, 223, 224, 226, 229, 234, 240, 247, 252, 258, 265, 270, 277, 282, 287, 293, 299, 305, 311, 316, 323, 329, 335, 341, 347, 352, 357, 359, 362, 363, 364, 364, 363, 357, 352, 347, 341, 336, 331, 324, 317, 311, 306, 301, 296, 291, 286, 280, 274, 269, 263, 258, 253, 247, 242, 239, 237, 235, 235, 235, 235, 235, 238, 244, 250, 255, 262, 267, 272, 278, 284, 291, 296, 301, 306, 313, 319, 324, 329, 334, 339, 344, 347, 349, 351, 351, 351, 351, 346, 341, 335, 330, 325, 320, 315, 309, 304, 298, 292, 285, 277, 272, 266, 261, 255, 249, 244, 239, 234, 234, 234, 236, 236, 239, 239, 243, 248, 253, 261, 267, 272, 278, 284, 290, 295, 300, 305, 310, 315, 320, 325, 330, 335, 337, 339, 339, 339, 337, 331, 324, 318, 311, 305, 298, 291, 286, 281, 276, 269, 263, 258, 251, 246, 241, 236, 231, 226, 222, 221, 221, 221, 224, 229, 235, 240, 245, 250, 256, 262, 267, 273, 278, 284, 290, 296, 303, 310, 315, 320, 325, 325, 325, 325, 320, 314, 309, 302, 297, 292, 286, 280, 275, 269, 264, 259, 254, 249, 243, 238, 233, 228, 223, 218, 213, 207, 203, 202, 202, 202, 202, 200, 195, 190, 185, 180, 175, 169, 163, 160, 155, 152, 148, 144, 139, 136, 131, 131, 128, 127, 127, 126, 125, 124, 124, 122, 122, 120, 119, 118, 117, 115, 115, 115, 113, 111, 110, 110, 108, 107, 107, 107, 107, 106, 105, 104, 103, 102, 100, 100, 100, 100, 98, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 97, 101, 104, 110, 116, 122, 127, 134, 140, 145, 150, 156, 163, 169, 174, 180, 186, 191, 198, 203, 208, 213, 219, 224, 231, 237, 242, 247, 247, 247, 247, 247, 247, 247, 247, 244, 241], [0, 152, 215, 256, 296, 360, 408, 464, 544, 624, 704, 880, 928, 1016, 1080, 1184, 1344, 1408, 1504, 1664, 1744, 1800, 1856, 1928, 1992, 2136, 2255, 2336, 2407, 2472, 2555, 2575, 2599, 2615, 2631, 2655, 2672, 2688, 2704, 2722, 2736, 2753, 2768, 2791, 2807, 2831, 2863, 2895, 2927, 2959, 2983, 3016, 3032, 3047, 3072, 3111, 3151, 3207, 3247, 3263, 3279, 3295, 3304, 3327, 3352, 3375, 3391, 3407, 3431, 3447, 3471, 3495, 3521, 3552, 3575, 3599, 3623, 3656, 3695, 3775, 3847, 3887, 3927, 3967, 4000, 4039, 4095, 4175, 4223, 4271, 4288, 4320, 4351, 4375, 4391, 4423, 4455, 4479, 4495, 4527, 4567, 4607, 4647, 4679, 4720, 4751, 4783, 4864, 4888, 4937, 4967, 5039, 5079, 5104, 5127, 5167, 5199, 5223, 5247, 5263, 5279, 5303, 5327, 5351, 5375, 5407, 5431, 5455, 5471, 5503, 5543, 5583, 5631, 5679, 5775, 5792, 5808, 5832, 5848, 5872, 5920, 5952, 5968, 5992, 6008, 6024, 6048, 6072, 6104, 6120, 6144, 6192, 6224, 6256, 6296, 6312, 6328, 6352, 6400, 6424, 6448, 6480, 6504, 6528, 6552, 6568, 6592, 6624, 6656, 6680, 6696, 6712, 6736, 6760, 6792, 6824, 6840, 6856, 6880, 6928, 7008, 7088, 7144, 7168, 7184, 7208, 7344, 7360, 7384, 7408, 7432, 7456, 7480, 7504, 7520, 7544, 7560, 7593, 7616, 7640, 7664, 7696, 7752, 7808, 7873, 7952, 8016, 8104, 8144, 8168, 8184, 8216, 8232, 8248, 8264, 8292, 8304, 8336, 8360, 8384, 8424, 8456, 8492, 8552, 8664, 8857, 8928, 9176, 9496, 9712, 9849, 9888, 9912, 9928, 9968, 10056, 10128, 10160, 10208, 10248, 10296, 10360, 10424, 10472, 10504, 10528, 10568, 10608, 10648, 10688, 10744, 10792, 10808, 10824, 10840, 10864, 10896, 10920, 10936, 10960, 10992, 11016, 11032, 11048, 11072, 11081, 11096, 11112, 11128, 11144, 11160, 11176, 11192, 11200, 11216, 11232, 11249, 11256, 11272, 11280, 11296, 11312, 11328, 11336, 11352, 11360, 11376, 11392, 11408, 11416, 11432, 11440, 11456, 11477, 11483, 11488, 11496, 11504, 11512, 11520, 11528, 11544, 11552, 11560, 11568, 11576, 11584, 11592, 11600, 11608, 11616, 11624, 11632, 11640, 11648, 11656, 11672, 11677, 11683, 11688, 11696, 11712, 11720, 11728, 11744, 11760, 11776, 11824, 11848, 11883, 11904, 11928, 11944, 11964, 11980, 11996, 12012, 12028, 12060, 12084, 12108, 12140, 12164, 12172, 12196, 12212, 12236, 12260, 12284, 12300, 12324, 12364, 12404, 12476, 12500, 12516, 12524, 12540, 12549, 12564, 12604, 12636, 12668]]] 5422568586608640 True 2017-03-03 14:19:32.744920 zebra
5881 US [[[646, 645, 644, 644, 641, 639, 638, 636, 634, 633, 631, 629, 627, 625, 623, 622, 620, 619, 618, 617, 616, 615, 620, 627, 633, 644, 650, 672, 678, 694, 699, 704, 716, 721, 736, 741, 747, 753, 753], [237, 242, 247, 253, 259, 265, 270, 275, 280, 285, 291, 296, 301, 306, 313, 318, 323, 328, 334, 339, 347, 352, 356, 356, 356, 352, 350, 342, 341, 335, 333, 330, 326, 324, 319, 317, 315, 312, 311], [0, 52, 74, 90, 104, 114, 122, 128, 134, 142, 148, 154, 161, 166, 174, 180, 188, 198, 210, 224, 260, 288, 397, 412, 418, 429, 434, 451, 455, 469, 475, 480, 495, 499, 518, 523, 536, 561, 666]], [[651, 656, 662, 669, 681, 686, 693, 699, 706, 712, 716, 718, 721, 723, 726, 727, 729, 732, 737, 743, 751, 761, 766, 783, 788, 804, 810, 815, 826, 838, 843, 845, 846, 847, 847, 847, 847, 846, 845, 845], [236, 239, 240, 240, 232, 230, 224, 219, 210, 200, 193, 188, 182, 175, 170, 165, 160, 155, 150, 147, 147, 146, 148, 154, 155, 159, 161, 163, 169, 180, 184, 190, 195, 200, 205, 210, 215, 220, 225, 225], [1416, 1438, 1461, 1478, 1501, 1506, 1520, 1528, 1545, 1568, 1586, 1600, 1618, 1634, 1646, 1665, 1684, 1701, 1729, 1751, 1778, 1795, 1800, 1818, 1822, 1836, 1842, 1847, 1861, 1884, 1906, 1928, 1972, 1989, 2001, 2030, 2088, 2229, 2255, 2412]], [[854, 867, 875, 883, 888, 893, 899, 906, 913, 921, 933, 942, 947, 955, 960, 967, 972, 977, 985, 991, 1001, 1009, 1019, 1026, 1035, 1043, 1050, 1058, 1066, 1072, 1081, 1088, 1098, 1105, 1113, 1119, 1125, 1148, 1154, 1163, 1168, 1184, 1190, 1197, 1203, 1208, 1213, 1218, 1224, 1230, 1238, 1245, 1251, 1256, 1261, 1269, 1275, 1280, 1286, 1291, 1297, 1304, 1309, 1314, 1323, 1329, 1335, 1340, 1346, 1350, 1355, 1357, 1359, 1361, 1361, 1359, 1354, 1348, 1343, 1338, 1332, 1326, 1321, 1315, 1306, 1301, 1293, 1287, 1280, 1275, 1268, 1257, 1252, 1244, 1234, 1227, 1222, 1217, 1209, 1200, 1193, 1181, 1172, 1164, 1155, 1144, 1138, 1124, 1111, 1106, 1098, 1089, 1080, 1071, 1063, 1056, 1048, 1036, 1025, 1016, 1010, 1003, 995, 987, 981, 973, 966, 959, 954, 949, 944, 937, 931, 923, 917, 910, 905, 900, 895, 890, 884, 879, 874, 869, 864, 859, 854, 849, 844, 837, 831, 825, 820, 815, 810, 804, 799, 794, 789, 785, 780, 775, 774, 774, 775, 775, 775, 775, 775, 776, 776, 775, 775, 774, 774, 773, 771, 768, 766, 765, 762, 761, 760, 762, 762, 762, 762, 771, 779, 787, 791, 805, 811, 819, 832, 839, 847, 855, 858, 865, 872, 876, 877, 883, 885, 888, 892, 895, 895], [229, 233, 233, 233, 233, 233, 233, 233, 233, 233, 233, 233, 233, 233, 233, 233, 233, 233, 232, 232, 232, 232, 232, 231, 231, 230, 230, 230, 229, 229, 229, 229, 229, 229, 229, 229, 229, 229, 229, 229, 230, 232, 233, 236, 237, 239, 241, 243, 246, 248, 251, 254, 255, 257, 259, 262, 265, 267, 270, 272, 274, 278, 280, 283, 288, 292, 295, 300, 307, 313, 318, 324, 330, 336, 341, 346, 348, 352, 354, 356, 354, 355, 356, 357, 358, 358, 358, 358, 358, 358, 358, 358, 358, 358, 358, 358, 357, 357, 357, 357, 357, 357, 357, 357, 357, 357, 357, 357, 356, 356, 356, 355, 354, 354, 354, 354, 354, 354, 354, 354, 354, 354, 354, 354, 354, 354, 353, 353, 353, 353, 353, 351, 351, 351, 351, 351, 351, 351, 351, 351, 350, 350, 350, 350, 349, 346, 344, 344, 343, 342, 339, 338, 336, 334, 332, 332, 331, 330, 329, 324, 320, 317, 323, 333, 339, 345, 350, 356, 362, 368, 373, 378, 383, 388, 396, 402, 411, 432, 440, 445, 456, 461, 466, 473, 479, 484, 489, 484, 478, 473, 468, 458, 453, 446, 433, 425, 416, 408, 403, 393, 381, 374, 369, 359, 354, 348, 342, 337, 337], [3019, 3155, 3163, 3169, 3173, 3177, 3181, 3185, 3189, 3193, 3199, 3202, 3204, 3208, 3211, 3215, 3217, 3219, 3222, 3224, 3229, 3233, 3237, 3240, 3245, 3249, 3253, 3257, 3262, 3266, 3269, 3274, 3278, 3282, 3286, 3290, 3296, 3311, 3316, 3323, 3330, 3345, 3352, 3357, 3364, 3370, 3372, 3377, 3383, 3388, 3394, 3400, 3402, 3407, 3411, 3416, 3421, 3425, 3428, 3433, 3438, 3445, 3450, 3454, 3461, 3468, 3474, 3482, 3490, 3499, 3510, 3519, 3529, 3538, 3554, 3584, 3606, 3624, 3635, 3643, 3659, 3665, 3670, 3675, 3681, 3686, 3691, 3695, 3700, 3703, 3708, 3713, 3716, 3718, 3723, 3726, 3729, 3732, 3735, 3740, 3743, 3749, 3752, 3758, 3761, 3767, 3769, 3774, 3780, 3783, 3788, 3791, 3795, 3798, 3803, 3808, 3811, 3817, 3822, 3829, 3834, 3837, 3842, 3848, 3853, 3859, 3865, 3870, 3876, 3879, 3887, 3897, 3902, 3910, 3920, 3929, 3935, 3943, 3949, 3958, 3967, 3978, 3989, 3999, 4012, 4027, 4042, 4059, 4076, 4089, 4102, 4116, 4125, 4132, 4145, 4157, 4171, 4186, 4206, 4527, 4571, 4617, 4756, 4770, 4777, 4787, 4795, 4803, 4811, 4822, 4828, 4832, 4835, 4838, 4845, 4849, 4855, 4868, 4876, 4884, 4895, 4903, 4919, 4939, 4952, 4973, 5015, 5072, 5080, 5086, 5090, 5099, 5103, 5106, 5114, 5118, 5124, 5135, 5138, 5147, 5161, 5169, 5173, 5185, 5194, 5206, 5225, 5254, 5375]], [[1180, 1185, 1190, 1195, 1202, 1207, 1218, 1225, 1232, 1248, 1253, 1260, 1265, 1297, 1304, 1316, 1324, 1332, 1347, 1355, 1363, 1380, 1387, 1395, 1409, 1414, 1421, 1433, 1439, 1448, 1453, 1458, 1460, 1460, 1459, 1456, 1455, 1453, 1449, 1446, 1439, 1437, 1433, 1431, 1428, 1423, 1418, 1411, 1405, 1401, 1395, 1389, 1384, 1379, 1377], [351, 354, 359, 361, 367, 370, 378, 383, 388, 399, 403, 408, 411, 431, 435, 440, 446, 449, 457, 462, 465, 475, 480, 485, 493, 498, 501, 508, 510, 512, 513, 511, 504, 493, 487, 480, 474, 467, 459, 451, 433, 428, 419, 414, 409, 400, 392, 382, 374, 369, 362, 356, 352, 349, 351], [5677, 5689, 5700, 5705, 5714, 5717, 5728, 5732, 5735, 5745, 5749, 5751, 5754, 5769, 5773, 5780, 5783, 5787, 5795, 5799, 5802, 5811, 5816, 5819, 5828, 5832, 5835, 5845, 5852, 5861, 5872, 5899, 5920, 5936, 5944, 5948, 5955, 5961, 5966, 5972, 5986, 5991, 5996, 6000, 6004, 6012, 6020, 6028, 6037, 6040, 6049, 6056, 6065, 6130, 6152]], [[1018, 1013, 1008, 1003, 997, 992, 986, 981, 977, 972, 968, 965, 957, 953, 950, 946, 943, 940, 936, 936], [234, 236, 240, 246, 253, 259, 264, 270, 275, 283, 288, 293, 303, 308, 313, 320, 325, 331, 336, 336], [6910, 6968, 6996, 7016, 7033, 7045, 7053, 7062, 7070, 7081, 7093, 7103, 7120, 7128, 7137, 7146, 7158, 7170, 7251, 7266]], [[1099, 1099, 1098, 1097, 1095, 1093, 1090, 1088, 1087, 1085, 1083, 1081, 1078, 1077, 1075, 1073, 1072, 1070, 1070], [242, 249, 256, 263, 269, 278, 285, 292, 297, 302, 310, 317, 324, 330, 335, 340, 345, 350, 352], [7687, 7756, 7769, 7777, 7786, 7793, 7802, 7810, 7815, 7819, 7827, 7835, 7843, 7852, 7859, 7872, 7888, 7910, 7995]], [[1158, 1161, 1165, 1174, 1176, 1187, 1192, 1195, 1199, 1205, 1211, 1217, 1224, 1226], [239, 247, 256, 275, 280, 304, 314, 319, 326, 334, 341, 346, 343, 340], [8342, 8382, 8392, 8412, 8417, 8437, 8447, 8452, 8460, 8469, 8480, 8504, 8583, 8601]], [[743, 744], [196, 196], [9359, 9481]], [[640, 640], [306, 306], [10739, 10871]], [[624, 629, 635, 640, 645, 650, 655, 660, 664, 669, 675, 682, 687, 687], [345, 345, 341, 338, 334, 330, 327, 321, 316, 313, 310, 308, 307, 307], [11364, 11436, 11488, 11513, 11529, 11548, 11567, 11590, 11622, 11674, 11713, 11733, 11743, 11752]], [[911, 911, 911, 909, 905, 903, 901, 897, 895, 892, 887, 883, 880, 877, 874, 868, 863, 862], [236, 242, 247, 254, 260, 265, 270, 275, 281, 287, 292, 297, 302, 307, 312, 318, 321, 323], [12333, 12367, 12388, 12412, 12438, 12451, 12461, 12470, 12481, 12490, 12505, 12524, 12539, 12553, 12567, 12587, 12608, 12716]], [[833, 828, 823, 817, 814, 810, 806, 803, 800, 799], [189, 191, 197, 202, 207, 212, 217, 222, 227, 227], [13159, 13281, 13307, 13327, 13342, 13366, 13389, 13415, 13441, 13551]], [[1228, 1230, 1233, 1236, 1240, 1246, 1251, 1257, 1262, 1267, 1276, 1283, 1289, 1294, 1299, 1305, 1310, 1310], [253, 263, 268, 274, 279, 282, 287, 293, 295, 299, 307, 312, 314, 316, 319, 321, 323, 323], [14246, 14267, 14279, 14295, 14312, 14333, 14346, 14357, 14363, 14369, 14389, 14400, 14405, 14411, 14419, 14432, 14452, 14591]], [[803, 797, 793, 788, 784, 779, 774, 771, 767, 763, 761, 757, 754, 750, 746, 743, 743], [223, 229, 234, 240, 245, 253, 260, 265, 270, 277, 282, 287, 292, 297, 302, 307, 307], [15702, 15749, 15768, 15784, 15795, 15807, 15823, 15836, 15852, 15879, 15892, 15905, 15916, 15926, 15941, 15984, 16062]], [[806, 808, 811, 818, 823, 830, 836, 842, 849, 854, 860, 863], [369, 374, 379, 382, 384, 387, 389, 391, 395, 397, 401, 403], [16683, 16722, 16741, 16759, 16775, 16787, 16803, 16815, 16827, 16838, 16855, 16947]], [[1391, 1391, 1387, 1382, 1376, 1368, 1363, 1358, 1355, 1351, 1348, 1344, 1342], [380, 385, 391, 396, 402, 408, 412, 416, 422, 428, 433, 438, 440], [17608, 17657, 17685, 17703, 17719, 17740, 17762, 17774, 17789, 17808, 17820, 17837, 17988]]] 6208998388793344 True 2017-03-26 17:36:15.512250 zebra

Data Preprocessing

Funtions to be used for Data Preprocessing

In [16]:
def get_num_strokes(data):
    """Return num_strokes set of stroke elements x, y, t coordinates in string
    format.
    """
    data = data[1:-1]
    x = data.split(']], ')
    x[-1] = x[-1][:-2]
    x = [st + ']]' for st in x]  
    return len(x)

def get_strokes(data, num_strokes=1):
    """Return num_strokes set of stroke elements x, y, t coordinates in string
    format.
    """
    data = data[1:-1]
    x = data.split(']], ')
    x[-1] = x[-1][:-2]
    x = [st + ']]' for st in x]    
    return x[:num_strokes]

def to_nparray(lst):
    """Convert string-formatted list of integers as np.array of integers."""
    lst = re.sub("[\[\]\s]", "", lst)
    lst = lst.split(',')
    lst = np.array([int(x) for x in lst])    
    return lst

def stroke_to_nparray(stroke):
    """Return as numpy array the list of stroke elements."""
    coord_pattern = re.compile(r"\[[0-9,\s]*\]")
    coordinates = coord_pattern.findall(stroke)
    coordinates = np.array([to_nparray(coord) for coord in coordinates])   
    return coordinates

def allstrokes_to_nparray(strokes):
    """Return as np.array the list of strokes."""
    return np.array([stroke_to_nparray(elem) for elem in strokes])

def get_final_n_strokes(data, num_strokes=1):
    """Return the completely np.array transformed strokes."""
    lst = get_strokes(data, num_strokes=num_strokes)    
    return allstrokes_to_nparray(lst)

def limit_stroke_length(data, min_len=50, max_len=70):
    """Return data with stroke length specified."""    
    data1 = data.copy()
    
    # Select rows with stroke length limits specified.
    data1 = data1.loc[(data1.stroke_length >= min_len) &
                      (data1.stroke_length <= max_len)]
    data1 = data1.loc[(data1.stroke_length >= min_len) & 
                      (data1.stroke_length <= max_len)]
    
    # Get only first `stroke_length` points.
    data1.x = data1.x.apply(lambda x : x[:min_len])
    data1.y = data1.y.apply(lambda x : x[:min_len])
    
    return data1

Get the number of stroke

In [8]:
df['stroke_num'] = df.drawing.apply(lambda x: get_num_strokes(x))

Get the first stroke

In [9]:
df['stroke1'] = df.drawing.apply(lambda x: get_final_n_strokes(x)[0])

Get only valid strokes (with x and y)

In [10]:
df = df[df['stroke1'].map(len) > 1].copy()

Add stroke length column

In [11]:
df["stroke_length"] = df.stroke1.apply(lambda x: len(x[0]))
In [12]:
stats = df.stroke_length.describe().compute()
In [13]:
mean = int(stats['mean'])
std = int(stats['std'])
print('mean: ', mean)
print('std: ', std)
mean:  86
std:  74

Separate x and y

In [14]:
df['x'] = df.stroke1.apply(lambda x: x[0])
df['y'] = df.stroke1.apply(lambda x: x[1])

Select rows with stroke_length identified.

In [15]:
df = limit_stroke_length(df, min_len=mean, max_len=np.inf)

Make each x and y coordinate as columns and final formatting

In [17]:
# creat a copy of the main df
df1 = df.copy()
In [18]:
# drop unnecessary columns
df1 = df1.drop(['drawing', 'key_id', 'timestamp', 'stroke1'],
               axis=1).dropna().reset_index(drop=True)
In [19]:
df1.columns
Out[19]:
Index(['countrycode', 'recognized', 'word', 'stroke_num', 'stroke_length', 'x',
       'y'],
      dtype='object')
In [21]:
df1 = df1.compute()
In [214]:
# Separate x stroke points
x_stroke = df1.x.apply(pd.Series).add_prefix('x_')
In [215]:
df1 = pd.merge(df1, x_stroke, left_index=True, right_index=True)
In [217]:
# Separate y stroke points
y_stroke = df1.y.apply(pd.Series).add_prefix('y_')
In [218]:
df1 = pd.merge(df1, y_stroke, left_index=True, right_index=True)
In [220]:
# Drop x, y columns
df1 = df1.drop(['x', 'y'], axis=1).dropna().reset_index(drop=True)
In [221]:
# Convert `recognized` values to integer.
df1.recognized = df1.recognized.apply(lambda x : int(x))
In [222]:
# Final pd DataFrame
df1
Out[222]:
countrycode recognized word stroke_num stroke_length x_0 x_1 x_2 x_3 x_4 ... y_76 y_77 y_78 y_79 y_80 y_81 y_82 y_83 y_84 y_85
0 US 1 ant 11 106 970 965 960 955 950 ... 477 476 475 472 469 466 463 460 455 450
1 IT 0 ant 2 682 1460 1456 1453 1449 1446 ... 1609 1621 1634 1643 1656 1673 1690 1707 1724 1737
2 US 1 ant 7 95 344 339 332 327 321 ... 193 193 193 193 193 193 193 193 193 195
3 ID 1 ant 12 299 707 704 699 694 689 ... 292 287 282 277 276 276 280 285 291 296
4 PT 1 ant 8 86 898 893 885 880 871 ... 350 350 350 350 350 350 350 350 350 350
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3853934 GB 1 zebra 6 200 369 358 352 344 339 ... 385 380 372 361 352 344 336 333 338 346
3853935 RU 1 zebra 22 264 383 389 395 401 410 ... 489 489 489 490 491 491 491 491 491 491
3853936 NL 0 zebra 14 262 490 485 480 473 467 ... 391 401 411 419 425 434 438 438 438 438
3853937 US 1 zebra 7 333 296 289 284 276 268 ... 483 474 462 450 436 422 407 393 378 359
3853938 GB 1 zebra 1 363 991 986 982 978 974 ... 258 253 247 242 239 237 235 235 235 235

3853939 rows × 177 columns

Save Preprocessed Dask DataFrame Allocated S3 Bucket

Partition Dask DataFrame to distribute the size

In [228]:
# Convert pd DataFrame to Dask DataFrame with Repartition
ddf = dd.from_pandas(df1, npartitions=4)

Save Dask DataFrame to CSV File to allocated S3 Bucket

In [230]:
ddf.to_csv('s3://bdcc-doodle/draw_processed_csv/draw_processed_*.csv', 
           index=False)
Out[230]:
['bdcc-doodle/draw_processed_csv/draw_processed_0.csv',
 'bdcc-doodle/draw_processed_csv/draw_processed_1.csv',
 'bdcc-doodle/draw_processed_csv/draw_processed_2.csv',
 'bdcc-doodle/draw_processed_csv/draw_processed_3.csv']

Validate if processed data is saved to the S3 Bucket s3://bdcc-doodle/draw_processed_csv/

In [231]:
%%bash
aws s3 ls s3://bdcc-doodle/draw_processed_csv/ --human-readable --summarize
2021-01-25 10:33:01  654.7 MiB draw_processed_0.csv
2021-01-25 10:33:01  654.1 MiB draw_processed_1.csv
2021-01-25 10:33:01  654.3 MiB draw_processed_2.csv
2021-01-25 10:33:01  654.8 MiB draw_processed_3.csv

Total Objects: 4
   Total Size: 2.6 GiB