데이터 전처리

이-프 2023. 5. 9. 14:48

AI HUB 한국어 글자체 이미지 중 TEXT IN THE WILD 다운로드

import json
with open('./textinthewild_data_info.json', 'rt', encoding='UTF8') as file:
    file = json.load(file)
file.keys() #dict_keys(['info', 'images', 'annotations', 'licenses'])
file['info'] #{'name': 'Text in the wild Dataset', 'date_created': '2019-10-14 04:31:48'}
type(file['images']) #list

file['images'][0]['type'] == 'books' # True
goods = [f for f in file['images'] if f['type']=='product']
len(goods) #26340

annotation = [a for a in file['annotations'] if a['image_id'] == goods[0]['id'] and a['attributes']['class']=='word']
annotation

import matplotlib.pyplot as plt
img = plt.imread('/data/nengcipe/dataset/Goods/'+goods[0]['file_name'])
plt.imshow(img)

데이터 1차 가공 - AI HUB 데이터 분할

데이터를 annotation 단위로 분할한다.

import random
import os

ocr_good_files = os.listdir('/data/nengcipe/dataset/Goods/')
print(len(ocr_good_files)) # 26340 데이터 가공시 확인해보니 태깅이 잘못 된 부분 o

#random.shuffle(ocr_good_files)

n_train = int(len(ocr_good_files) * 0.7)
n_validation = int(len(ocr_good_files) * 0.2)
n_test = int(len(ocr_good_files) * 0.1)

print(n_train, n_validation, n_test) #18438 5268 2634
#70 : 20 : 10의 비율로 나눠서 순서대로 이미지를 나누고 각 이미지에 해당하는 annotation 정보를 함께 저장

train_files = ocr_good_files[:n_train]
validation_files = ocr_good_files[n_train: n_train+n_validation]
test_files = ocr_good_files[-n_test:]

##train/validation/test 이미지들에 해당하는 id 값들을 저장 
train_img_ids = {}
validation_img_ids = {}
test_img_ids = {}

for image in file['images']:
    if image['file_name'] in train_files:
        train_img_ids[image['file_name']] = image['id']
    elif image['file_name'] in validation_files:
        validation_img_ids[image['file_name']] = image['id']
    elif image['file_name'] in test_files:
        test_img_ids[image['file_name']] = image['id']

print(len(train_img_ids)) # 18440으로 잘 들어갔음을 확인함

##train/validation/test 이미지들에 해당하는 annotation 들을 저장
train_annotations = {f:[] for f in train_img_ids.keys()} 
# train_img_ids란 딕셔너리의 key값을 모두 리스트로 가져오는 코드
validation_annotations = {f:[] for f in validation_img_ids.keys()}
test_annotations = {f:[] for f in test_img_ids.keys()}

train_ids_img = {train_img_ids[id_]:id_ for id_ in train_img_ids}
# train_img_ids란 딕셔너리의 key값을 가져와서 새로운 딕셔너리로 만드는 코드
# {id_: id_} 형태의 딕셔너리를 만들어서 {새로운 딕셔너리}에 추가하는 것 
validation_ids_img = {validation_img_ids[id_]:id_ for id_ in validation_img_ids}
test_ids_img = {test_img_ids[id_]:id_ for id_ in test_img_ids}

for idx, annotation in enumerate(file['annotations']):
    if idx % 5000 == 0:
        print(idx,'/',len(file['annotations']),'processed')
    if annotation['attributes']['class'] != 'word':
        continue
    if annotation['image_id'] in train_ids_img:
        train_annotations[train_ids_img[annotation['image_id']]].append(annotation)
    elif annotation['image_id'] in validation_ids_img:
        validation_annotations[validation_ids_img[annotation['image_id']]].append(annotation)
    elif annotation['image_id'] in test_ids_img:
        test_annotations[test_ids_img[annotation['image_id']]].append(annotation)

with open('train_annotation.json', 'w') as file:
    json.dump(train_annotations, file)
with open('validation_annotation.json', 'w') as file:
    json.dump(validation_annotations, file)
with open('test_annotation.json', 'w') as file:
    json.dump(test_annotations, file)

[output]

결과적으로 annotation 파일 생성

⇒ 실제 train_annotaion 파일의 모습이다. json이 구조적이지 않게 나왔기 때문에 validator를 사용하여 정렬했다.

The JSON Validator

데이터 2차 가공

각 이미지에 해당하는 annotation 값을 이용해 'bbox' 위치 정보로 단어 영역을 자름

문제점 : x,y,w,h값이 0이나 음수인 경우가 있다는 것을 학습 중 발견 => 디버깅 진행
해결점 : 데이터 저장 시, 예외처리를 진행하여 걸러내기 필요

⚡신경망 모델 학습 단계는 CLOVA AI에서 제공하는 DEEP_TEXT_RECOGNITION_BENCHMARK 오픈소스 프로젝트 이용

pip3 install fire
python3 create_lmdb_dataset.py --inputPath data/ --gtFile data/gt.txt --outputPath result/

data
├── gt.txt
└── test
    ├── word_1.png
    ├── word_2.png
    ├── word_3.png
    └── ...

이걸 사용하려면, 디렉의 형태를 이렇게 바꾸어야함

pip3 install fire
python3 create_lmdb_dataset.py --inputPath data/ --gtFile data/gt.txt --outputPath result/

'AI' 카테고리의 다른 글

LoRA, QLoRA, LoRA-FA 분석 (0)	2024.11.18
Continue Extension을 활용한 AI Coding 평가 (2)	2024.10.21
Prompt란 ? (feat. LLM) (0)	2024.08.28
Ollama / Embedding (0)	2024.08.27
LLM / LangChain / Fine-Tuning (1)	2024.08.26

현재글데이터 전처리

BFS, 국비지원, 백준, 프로그래머스, springboot, db, 국비지원취업, 패스트캠퍼스, Java, 항해99, 코딩테스트준비, llm, 백엔드개발자, DP, 스프링, 부트캠프, 자바, 완전탐색, Til, 99클럽,

Today :
Yesterday :

for IF